Commit Graph

2275 Commits

Author SHA1 Message Date
Jerry Zhang
7ddf212f33 [quant][fx] Fully align convert with the reference model design and simplify the implementation (#73863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73863

This PR fully aligns the convert function with the design: https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md
and simplifies the implementation of convert function by always produce a reference quantized model (with reference patterns) first,
and then lower the model to a quantized model that is runnable with PyTorch native backend (fbgemm/qnnpack).

This PR makes the convert.py much easier to understand than the previous implementation, and we are able to remove majority of code
in quantization_patterns.py as well (in followup PRs).

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```
and other internal/oss regression tests

Imported from OSS

Reviewed By: andrewor14

Differential Revision: D34778506

fbshipit-source-id: 0678b66addf736039a8749b352f6f569caca962b
(cherry picked from commit 33ec9caf23f3ab373d827117efbd9db0668b2437)
2022-03-11 17:11:30 +00:00
Junjie Wang (PyTorch)
616b36e437 [PT-D][FSDP] Implement _clip_grad_norm_ for FSDP (#73405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73405

Implement the `_clip_grad_norm_` for FSDP, issue: https://github.com/pytorch/pytorch/issues/72548
ghstack-source-id: 151059433

Test Plan: CI

Reviewed By: rohan-varma

Differential Revision: D34230605

fbshipit-source-id: bbac7a6e49276e0f0502e2f4466c984aee2629fa
(cherry picked from commit f10d090cd11489608ab3f67f52e3e950cd9f7dea)
2022-03-11 00:41:07 +00:00
Xiao Wang
5b805a6eec Disable TF32 in some linalg tests; Disable TF32 in svd_lowrank forward (#73614)
Summary:
Follow up of https://github.com/pytorch/pytorch/pull/73460, https://github.com/pytorch/pytorch/issues/73461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73614

Reviewed By: malfet

Differential Revision: D34772822

Pulled By: ngimel

fbshipit-source-id: 4e2bea0173d1b6b01e857ef63ef5c2d8c3802544
(cherry picked from commit 599486314370c5d2c771724139c0186ce190990b)
2022-03-10 19:12:02 +00:00
Natalia Gimelshein
967606124a port torch cov tests to error inputs (#73977)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73977

Reviewed By: malfet

Differential Revision: D34779552

Pulled By: ngimel

fbshipit-source-id: b4191101a029981eb27c75e1b56d739db046f819
(cherry picked from commit 2c2af726ffdba68f358a4ff0ee07580609bccc34)
2022-03-10 19:04:44 +00:00
Samantha Andow
78e17eaadc expanded weights: conv faster rule (#73692)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73692

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D34719302

Pulled By: samdow

fbshipit-source-id: 2288320a5f5d6a442da78e9fbe722f300b844be9
(cherry picked from commit a4cf23383c16d3c61d53e9d21f426259d2dc2d37)
2022-03-10 04:06:08 +00:00
Thiago Crepaldi
1fbc08c70c Add Autocast support for Einsum (#71916)
Summary:
ONNX spec for Einsum requires all inputs to be the same dtype.

PyTorch runtime does not allow executing aten::einsum with
mismatching types by default, so the export would never succeed,

However, when the model is wrapped by `torch.autocast()`,
the run succeeds and the ONNX converter will create an Einsum ONNX node
with mismatch types as input, which is not allowed by the aforementioned schema.

This PR adds onnx::Einsum to the Autocast enabled list, which outputs lower resolution tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71916

Reviewed By: ngimel

Differential Revision: D34629666

Pulled By: malfet

fbshipit-source-id: ec757bb87190a5b7512969e10a32450e9e1f87a1
(cherry picked from commit 7f2b5a6408ae34a6b9f858c3e9f5970b64ca1b4b)
2022-03-08 22:04:30 +00:00
Natalia Gimelshein
e47a5a64bb Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579)
Summary:
Original commit changeset: 87b1220d851c

Original Phabricator Diff: D34524207 (4eb2482568) (4eb2482568)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579

Test Plan:
OSS tests
tested with canary https://www.internalfb.com/intern/ads/canary/441912928798660873

Reviewed By: ezyang

Differential Revision: D34688237

Pulled By: ngimel

fbshipit-source-id: 32f3a0046053ef52e95ab45a26bfc1de17e7e061
(cherry picked from commit d1c0acbe3e0ff884c429072923a468ee1d3d447d)
2022-03-08 19:15:30 +00:00
Andrew Gu
9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842

**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
    - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
    - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
    - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
    - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
    - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
    - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34675709

Pulled By: awgu

fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
2022-03-08 13:15:20 +00:00
Peter Bell
9ef5c679ef record_function: add torchbind alternative API (#72301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72301

First step in resolving #35026.

This adds `PythonRecordFunction` which is a `torch::CustomClassHolder`
for `at::RecordFunction` to keep the ATen code free of torch includes.
And adds new unused internal API functions
`_record_function_enter_new` which return the torchbind object.

Once the FC period is expired, `torch.profiler.record_function` will
be updated to use this new internal API. Then once BC period is
expired, the cpp_custom_type_hack-based API can be removed.

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D34586311

Pulled By: robieta

fbshipit-source-id: d3eb9ffad7b348548a2b22c75203a92d1cb5115b
(cherry picked from commit 92d2ca808e5fbd20c9d6645dcabc3f059f9ef2d3)
2022-03-08 03:26:27 +00:00
soulitzer
de73f9a558 Add forward AD support for logsumexp, log_softmax, softmax, nll_loss, and cross_entropy (#73741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73741

There are probably more perf improvements that can be made, for example reusing more quantities from forward, doing more things inplace, but in the spirit of improving coverage, this is probably OK for now.

Note: I didn't do anything with half_to_float, but CUDA (locally) hasn't complained yet

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34690141

Pulled By: soulitzer

fbshipit-source-id: fe934e191fee2c8e956d7a5f4b553923adf1b33f
(cherry picked from commit ae49aff7f7c8496e04a3ce7667d8f068ca0a52ec)
2022-03-08 00:46:27 +00:00
anjali411
086645ad77 Update __torch_dispatch__ to return op overload instead of the opoverload packet function (#72673)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72673

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D34627164

Pulled By: anjali411

fbshipit-source-id: 3cb6406a392d530bf9da36b4d8e0a62b30e6497e
(cherry picked from commit 65b85a0a67df4d0f16ac8964e2b685d478a610fb)
2022-03-07 22:38:42 +00:00
Andrew Gu
4a06b8d36c [FSDP] Add grad accumulation without no_sync() (#73535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73535

**Overview**
- This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker.
- This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor.
- This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like:
```
<built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error
```

NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading.

**Test Plan**
I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`.

After this diff:
- QPS (ResNet): f328439897
- QPS (RoBERTa): f328440141
- Accuracy: f328442119

Before this diff (trunk):
- QPS (ResNet): f328432756
- QPS (RoBERTa): f328436766
- Accuracy: f328437896

Test Plan: Imported from OSS

Reviewed By: zhaojuanmao

Differential Revision: D34533546

Pulled By: awgu

fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c
(cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)
2022-03-07 20:33:22 +00:00
Pritam Damania
aca4d02d12 Use higher timeout for test_tensorpipe_set_default_timeout (#73771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73771

The runtime for this test doesn't actually depend on the timeout value
specified here. As a result, increasing the timeout to avoid flakiness.

https://ossci-raw-job-status.s3.amazonaws.com/log/4666724994 is an example of
where this test failed due to a small timeout as reported in
https://github.com/pytorch/pytorch/issues/70546
ghstack-source-id: 150507765

Test Plan:
1) waitforbuildbot
2) run the unit test

Reviewed By: mrshenli

Differential Revision: D34632204

fbshipit-source-id: ffe0f40d08f7a36f90f30f493a189608897bbb4c
(cherry picked from commit a4920a4bfcbd26967567b55ee8417e994d53df49)
2022-03-04 23:29:18 +00:00
Natalia Gimelshein
55525632ab Revert D34554432: Back out "Revert D34524207: [pytorch][PR] remove _s_where"
Test Plan: revert-hammer

Differential Revision:
D34554432 (9c03c6163f)

Original commit changeset: 2f3601d3d426

Original Phabricator Diff: D34554432 (9c03c6163f)

fbshipit-source-id: db434750f44c6e6ec545a248c462d8fdcbefbaf8
(cherry picked from commit 866d4d0c795edd7ef519925683b5e57dd9b116ad)
2022-03-04 20:32:39 +00:00
Natalia Gimelshein
9c03c6163f Back out "Revert D34524207: [pytorch][PR] remove _s_where" (#73579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73579

Original commit changeset: 87b1220d851c

Original Phabricator Diff: D34524207 (4eb2482568)

Test Plan: OSS tests

Reviewed By: malfet

Differential Revision: D34554432

fbshipit-source-id: 2f3601d3d4261ebcebb05b4b1aec0c9a8a00ea04
(cherry picked from commit b9cad3f2bc54e12b275567454336cf4d9dcb78c4)
2022-03-04 19:35:41 +00:00
wayi1
0bb3b0652c [Model Averaging] Support hierarchical model averaging (#73285)
Summary:
Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325.

Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs.

In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285

Reviewed By: mrshenli

Differential Revision: D34457792

Pulled By: rohan-varma

fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba
(cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)
2022-03-04 18:29:36 +00:00
Shihao Xu
bcd0843bec [torch.distributed][DDP] Disable DDP bucketing for the first iteration (#72843)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843

# [Debug Story] Training Hanging and DDP Bucketing

**What are the characteristics of the hanging training instance?**

The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution.

The model config difference to trigger this hanging issue is turning on position weighted embedding tables.

A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli).

**What was the initial manifested error?**

The training was stuck in the first iteration.

**What are useful debugging tools this time?**

After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure.

After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all.

With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks.

**What is DDP bucketing?**

Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks.

Say we have 4 tensor ops. A, B, C, D.

In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready.

The time sequence would be,

* D.grad
* C.grad
* B.grad
* A.grad
* All reduce on [D.grad, C.grad, B.grad, A.grad].

But that would be a huge waste of communication channel bandwidth.

With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes,

* D.grad
* C.grad
* All reduce on [D.grad, C.grad].
* B.grad
* A.grad
* All reduce on [B.grad, A.grad].

With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance.

**What exactly went wrong in this case?**

1. The bucketing doesn’t honor backward graph execution order.
2. There are other collectives comm ops in backward graph.
3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world.

Using the above example again, we have 4 tensor ops. A, B, C, D.

Say we have 2 trainers,

B is the feature processor module.

B only runs on trainer 0 (both forward and backward), but not on trainer1.

C is the All-to-all (Pooled embeddings distribution).

C sends out all-to-all collective in both its forward and backward pass.

Keep assuming all other ops run on both trainers.

trainer_0 op sequence is,

A, B (feature preproc), C (all-to-all), D | D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad

trainer_1 op sequence is,

A, C (all-to-all), D | D.grad, C.grad (reverse all-to-all), A.grad

Even though the correct bucketing should be (same bucketing for both ranks),

* bucket_0, [D.grad, C.grad]
* bucket_1, [B.grad, A.grad]

but because of 1), they end up like,

* bucket_0, [B.grad, D.grad]
* bucket_1, [C.grad, A.grad]

Plus 2) and 3), the time sequence could like,

(check mark represents the gradient is ready)

(bucket is ready to do synchronization if all its enclosing gradients are ready)

* trainer_0
   * t0,
      * D.grad
      * bucket_0, [B.grad, D.grad ✓]
   * t1,
      * **C.grad all-to-all**
      * C.grad ✓
      * bucket_1, [C.grad ✓, A.grad]
   * t2
      * B.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓
   * t3
      * All-reduce for bucket_0
   * t4
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓
* trainer_1
   * t0,
      * D.grad
      * bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.)
   * t1,
      * **All-reduce for bucket_0**
   * t2
      * C.grad all-to-all
      * bucket_1, [C.grad ✓, A.grad]
   * t3
      * A.grad
      * bucket_1, [C.grad ✓, A.grad ✓] ✓

This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce.

**What is the solution for fixing DDP?**

Disable DDP bucketing for the first iteration. D34051938

This is because after the first iteration, buckets will be built again based on real backward graph execution order.

So the slow gradient synchronization only affects the first iteration.

Test Plan:
buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn
BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu

P484179296

buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn
BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes
P484177200

Reviewed By: zhaojuanmao

Differential Revision: D34051938

fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9
(cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)
2022-03-04 18:29:36 +00:00
Khushi Agrawal
905efa82ff [fix] torch.broadcast_shapes should not handle shapes with negative dimensions. (#72999)
Summary:
Hi,
The PR fixes https://github.com/pytorch/pytorch/issues/68957. It aims to include the following:
- Fixes the code in `torch/functional.py`.
- Add the missing tests for negative input values and non-iterable inputs.

~#### TODO~
~- [x] Add OpInfo~
EDIT: `broadcast_shapes` don't take any tensor inputs. So we don't need OpInfo here. Thanks, kshitij12345 for guidance.

#### Earlier
```python
>>> shapes = [1, -12]
>>> torch.broadcast_shapes(*shapes)
torch.Size([-12])    # MUST RAISE ERROR
```

#### Now
```python
>>> shapes = [1, -12]
>>> torch.broadcast_shapes(*shapes)
RuntimeError: Trying to create tensor with negative dimension -12: [-12]
```

#### NumPy's Output
```python
>>> shapes = [1, -12]
>>> numpy.broadcast_shapes(*shapes)
ValueError: negative dimensions are not allowed
```

#### `torch.broadcast_tensor()` Output
As mentioned in the [doc](https://pytorch.org/docs/stable/generated/torch.broadcast_shapes.html):
```python
>>> shapes = [1, -12]
>>> torch.broadcast_tensors(*map(torch.empty, shapes))[0].shape
RuntimeError: Trying to create tensor with negative dimension -12: [-12]
```

Looking forward to hearing from you and your questions. Thanks! :)

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72999

Reviewed By: albanD

Differential Revision: D34543995

Pulled By: ngimel

fbshipit-source-id: e32b1f266500a5e002c8f353b1e02f44c23d4f6e
(cherry picked from commit a6253ce6bb8455a3c89398f12b7d790a0b7e8d95)
2022-03-03 18:33:06 +00:00
Pearu Peterson
4168c87ed3 Support CSR to COO conversion in to_sparse(2). (#73642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73642

Former https://github.com/pytorch/pytorch/pull/73471 that was reverted due to lack of `to_sparse(sparse_dim)` support.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D34580353

Pulled By: cpuhrsch

fbshipit-source-id: a8a4ea381daeb80d8365fe931af9f55a7e789ea1
(cherry picked from commit 5a3cf8110980e5a10dbb687e87e67d5524ebf2f5)
2022-03-02 22:33:32 +00:00
Nikita Karetnikov
eb0d370f14 Write explicit meta-kernels for normal (#70089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70089

See #69386.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34089964

Pulled By: bdhirsh

fbshipit-source-id: eb88eb7c4830545d3d43c82b6f3abb98617cee8e
(cherry picked from commit 89c9c02a0fb1c780495fee6370961104f4b1dcd1)
2022-03-01 23:28:14 +00:00
Chien-Chin Huang
6396547f9e [FSDP] Make summon_full_params a public method (#73116)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73116

Users may need summon_full_params() to get the original parameters.
ghstack-source-id: 150134237

Test Plan: CI

Reviewed By: rohan-varma

Differential Revision: D34353034

fbshipit-source-id: ac69cc032da177903cd9969094f3f82dc6a61636
(cherry picked from commit 55d34fdee3778110a165a13ae987d0339e8d33c7)
2022-03-01 22:29:28 +00:00
Nikita Shulga
8ac7393565 Revert D33767740: [pytorch][PR] Sparse CSR CPU: cuSolverSP backend for linalg.solve
Test Plan: revert-hammer

Differential Revision:
D33767740 (199d9a992c)

Original commit changeset: a945f065210c

Original Phabricator Diff: D33767740 (199d9a992c)

fbshipit-source-id: b7934df18118f8d6d5f165deb5aae9887953ae43
(cherry picked from commit d3ddbb021b227e3638f6f7c22c6eadfa73695e31)
2022-03-01 18:33:23 +00:00
Nikita Shulga
dd9517cc4a Revert D34524207: [pytorch][PR] remove _s_where
Test Plan: revert-hammer

Differential Revision:
D34524207 (4eb2482568)

Original commit changeset: bc71e27b6d3f

Original Phabricator Diff: D34524207 (4eb2482568)

fbshipit-source-id: 87b1220d851c3d2b51bdd1cf2f8a493c58ab9b14
(cherry picked from commit af1f0cc9e032b00619a7979bbbd2281f69e0fdf0)
2022-03-01 17:43:16 +00:00
Rohan Varma
1cf6b34c0e [Easy][Tests] Rename module in test (#73551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73551

Rename to better indicate what it is.
ghstack-source-id: 150166352

Test Plan: CI

Reviewed By: awgu

Differential Revision: D34537964

fbshipit-source-id: 5465003c2a2fd6f1a2646c375bc7c11d297e3f9e
(cherry picked from commit 9f11bdef88c7886b59fedc939e7149872ad73453)
2022-03-01 16:37:37 +00:00
Rohan Varma
95204c4e2b Revert D34503882: Support CSR to COO conversion in to_sparse.
Test Plan: revert-hammer

Differential Revision:
D34503882 (84f4e9c10a)

Original commit changeset: 4a781647a0ae

Original Phabricator Diff: D34503882 (84f4e9c10a)

fbshipit-source-id: cf161171a3b51aa3c0f2b15501956873b1ba29dd
(cherry picked from commit 924c19071713777700087087b27b388eb057d8d9)
2022-03-01 15:33:37 +00:00
Natalia Gimelshein
4eb2482568 remove _s_where (#73468)
Summary:
Per title
Fixes https://github.com/pytorch/pytorch/issues/73135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73468

Reviewed By: albanD

Differential Revision: D34524207

Pulled By: ngimel

fbshipit-source-id: bc71e27b6d3fa50de6737533c92375266d9eadc5
(cherry picked from commit 047b925849370e6e4cbe9e3a722db52bb1e965b9)
2022-03-01 07:30:34 +00:00
Pearu Peterson
84f4e9c10a Support CSR to COO conversion in to_sparse. (#73471)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73471

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D34503882

Pulled By: cpuhrsch

fbshipit-source-id: 4a781647a0ae5d03827406b75b14acc7c48da0b0
(cherry picked from commit fa3dbdc6a8529d19f8a055494436ca1f766807be)
2022-03-01 06:31:52 +00:00
Kushashwa Ravi Shrimali
199d9a992c Sparse CSR CPU: cuSolverSP backend for linalg.solve (#71399)
Summary:
This PR introduces the `cuSolverSP` backend for `linalg.solve` with sparse CSR input matrices. The motivation comes from the issue: https://github.com/pytorch/pytorch/issues/69538.

`cuSolver` provides [`cusolverSp<t>csrlsvluHost`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu) API, a few things to note:

1. As mentioned in the documentation: `only CPU (Host) path is provided.` From the profiling, there doesn't seem to be any GPU kernel launch for optimization, please see the profiling below.
2. Since only `host` path is provided, the CPU path uses `csrlsvluHost` (but requires PyTorch to be installed/built with CUDA support).
3. The documentation mentions reordering helps optimize stuff, but it isn't clear how it affects the performance. There are options for reordering, so we stick to `reorder = 0` as the default choice.

`cuSolver` has [`csrlsvqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr) function which provides a `device` path to solve the linear system. This function is used for the CUDA path in this PR.

**Gist:**

For CPU Path: we call [`csrlsvluHost` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu).
For CUDA Path: we call [`csrlsvqr` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr).

**Profiling:** (On sparse input tensor of size 1000 x 1000, with a vector of shape length 1000), for `csrlsvlu` function (to show no GPU optimization)

```cpp
==3999651== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  2.1440us         1  2.1440us  2.1440us  2.1440us  [CUDA memcpy HtoD]
      API calls:   99.72%  1.07199s         9  119.11ms     500ns  1.07164s  cudaFree
                    0.11%  1.2182ms       398  3.0600us     140ns  137.94us  cuDeviceGetAttribute
                    0.06%  674.45us         4  168.61us  165.50us  173.64us  cuDeviceTotalMem
                    0.03%  357.07us         4  89.268us  2.7800us  201.89us  cudaMalloc
                    0.03%  309.29us         1  309.29us  309.29us  309.29us  cudaGetDeviceProperties
                    0.01%  160.47us       332     483ns     350ns  3.3300us  cudaFuncSetAttribute
                    0.01%  115.12us         4  28.780us  26.290us  33.410us  cuDeviceGetName
                    0.00%  28.591us         5  5.7180us     440ns  16.921us  cudaGetDevice
                    0.00%  22.061us         4  5.5150us     871ns  18.690us  cudaDeviceSynchronize
                    0.00%  20.370us        18  1.1310us     410ns  6.9900us  cudaEventDestroy
                    0.00%  16.390us         1  16.390us  16.390us  16.390us  cudaMemcpy
                    0.00%  11.540us         2  5.7700us  1.4900us  10.050us  cuDeviceGetPCIBusId
                    0.00%  10.510us        18     583ns     430ns  1.6200us  cudaEventCreateWithFlags
                    0.00%  7.9100us        21     376ns     290ns     700ns  cudaDeviceGetAttribute
                    0.00%  1.4300us         6     238ns     150ns     590ns  cuDeviceGet
                    0.00%  1.2200us         4     305ns     190ns     500ns  cuDeviceGetCount
                    0.00%     900ns         1     900ns     900ns     900ns  cuInit
                    0.00%     860ns         4     215ns     180ns     260ns  cuDeviceGetUuid
                    0.00%     240ns         1     240ns     240ns     240ns  cuDriverGetVersion
                    0.00%     230ns         1     230ns     230ns     230ns  cudaGetDeviceCount
```

Script:

```python
import torch

def solve(x, other, out):
    torch.linalg.solve(x, other, out=out)

if __name__ == "__main__":
    dense_inp = torch.randn((1000, 1000), dtype=torch.float64)
    # Set 50% of the values to 0 randomly
    dense_inp = torch.nn.functional.dropout(dense_inp, p=0.5)
    sparse_inp = dense_inp.to_sparse_csr()

    other = torch.randint(100, (1000,), dtype=torch.float64)
    out = torch.randint(1, (1000,), dtype=torch.float64)

    solve(sparse_inp, other, out)
```

The following error is raised when the function is used on a CPU device with PyTorch built/installed without CUDA support:
* When built without CUDA support:

```python
/home/krshrimali/pytorch/torch/autograd/profiler.py:151: UserWarning: CUDA is not available, disabling CUDA profiling
  warn("CUDA is not available, disabling CUDA profiling")
Traceback (most recent call last):
  File "/home/krshrimali/pytorch/test_sp.py", line 17, in <module>
    solve(x, other, out)
  File "/home/krshrimali/pytorch/test_sp.py", line 5, in solve
    torch.linalg.solve(x, other, out=out)
RuntimeError: PyTorch was not built with CUDA support. Please use PyTorch built CUDA support
```

**Performance Comparison** (vs SciPy's [`scipy.sparse.linalg.spsolve`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.spsolve.html):

Time taken by `scipy.sparse.linalg.spsolve` : 0.595 seconds

On CPU: Time taken by `torch.linalg.solve` : 4.565 seconds
On CUDA: Time taken by `torch.linalg.solve`: 1.838 seconds

The inputs are of dimensions: (17281, 17281) and (17281, 1), and were taken from https://math.nist.gov/MatrixMarket/extreme.html.

Thanks to IvanYashchuk for helping me with the PR, and guiding me through it.

cc: IvanYashchuk pearu nikitaved cpuhrsch

cc nikitaved pearu cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/71399

Reviewed By: VitalyFedyunin

Differential Revision: D33767740

Pulled By: cpuhrsch

fbshipit-source-id: a945f065210cd719096eb8d7cdbf8e8937c2fce9
(cherry picked from commit f4f35c17da414e1ca6c6d91402933521857aa1ea)
2022-03-01 05:32:35 +00:00
Rohan Varma
6b424de338 [FSDP] Add state_dict() save/reload in parity test (#73366)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73366

Adds state_dict() save/reload in parity with DDP test to ensure
checkpointing doesn't cause issue with accuracy/model params.
ghstack-source-id: 150114251

Test Plan: CI

Reviewed By: fegin

Differential Revision: D34434358

fbshipit-source-id: fb0787486b383cfcbec7cc1325a486c8d9b1e2ea
(cherry picked from commit e3bcc7733cb5a497a640007044b1138dfee3a532)
2022-03-01 04:35:30 +00:00
Yanli Zhao
6b883d9933 Back out "[BE][DDP] enable rebuilt bucket when find_unused_parameters=True" (#73524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73524

Original commit changeset: 73284c3629ff

Original Phabricator Diff: D34410523 (a6c6f42c25)
ghstack-source-id: 150128700

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D34527951

fbshipit-source-id: 15c9d1b3e52b3f2e20fdbd3cf0cb9de78b824d2a
(cherry picked from commit 10ec5881fa1cb6675e13e4148b2ba157ebf39b19)
2022-03-01 04:35:30 +00:00
Howard Huang
6c8e516a80 Add pickling support for WorkerInfo (#73371)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73371

This PRs allows for the pybinded class `WorkerInfo` to be pickled. The class is pickled into a tuple of worker_name and rank in format `(NAME, ID)`. This allows WorkerInfo to be passed as an argument for RPC calls.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34458153

Pulled By: H-Huang

fbshipit-source-id: 7b8f99960bdc0e24021e252d8c8138bcb53f698c
(cherry picked from commit 8fb119bf760eef9f313a44e9287c9253cbb09cae)
2022-02-28 15:37:56 +00:00
Rohan Varma
540361fa53 [FSDP] full_state_dict impl (#73324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73324

Implements `state_dict` and `load_state_dict` APIs for FSDP, with the following limitations:

1. Does not support `state_dict_device` (i.e. specifying which device params should be on) which fairscale does currently support
2. Does not yet support offload of state_dict onto CPU
3. Loads state_dict on all ranks currently. In the future we could add support for loading this on only rank 0, to avoid redundancy across ranks as usually only one rank is responsible for saving/loading the model. Along with (2) this would enable larger models to have state_dict called.

As discussed in FSDP checkpoint API proposal, `state_dict` will basically be a `full_state_dict` where full parameters are returned on all ranks. As a result this implies that the model must actually be able to fit on a single GPU.
ghstack-source-id: 150012240

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D34433514

fbshipit-source-id: 3eb1d679b2236264f9f423e761d1720f9aaec73a
(cherry picked from commit a451d5a08ebfa14a229a25fea35b9ca59fe91a59)
2022-02-27 19:32:22 +00:00
Sergii Dymchenko
285272f399 Fix undefined variable errors (#72838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72838

Reviewed By: george-qi

Differential Revision: D34406757

Pulled By: kit1980

fbshipit-source-id: b7ab8b431eb5715fe2278ca0979542c332f1deab
(cherry picked from commit fd0cbebb16e4b8eb50103f6859c5f1f1e2a52968)
2022-02-25 11:28:53 +00:00
Yanli Zhao
a6c6f42c25 [BE][DDP] enable rebuilt bucket when find_unused_parameters=True (#73276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73276

There are two major cases when find_unused_parameters=True:
1. grad ready order does not change over iterations, in this case, enable rebuilt bucket after first iteration can potentially improve performance
2. grad ready order changes over iterations, in this case, use static bucket order or dynamic bucket order in the first iteration does not matter much, as order changes per iteration
ghstack-source-id: 149820812

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D34410523

fbshipit-source-id: 73284c3629ff2696de76681f070b74ad2bb01f1b
(cherry picked from commit fa3a54bdd659669b776439190039ad889cf3371f)
2022-02-25 07:28:37 +00:00
Philip Meier
c6f1bbc0ac promote torch.testing to stable (#73348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73348

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34457727

Pulled By: mruberry

fbshipit-source-id: 2cc812b643e0d1e753bead2751ee79b3f03fde20
(cherry picked from commit bcdaca1a019a679b8b274e2fb5f19bfd08874ce9)
2022-02-25 06:30:31 +00:00
Philip Meier
14bcd3f681 cleanup torch.testing namespace (#72708)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72708

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34457728

Pulled By: mruberry

fbshipit-source-id: 8e017d2a1fd45f69533d1cdfd906d242b6b3ee68
(cherry picked from commit 8a2333a5668e64b45ab8cbc195e5e06383d49c0a)
2022-02-25 06:30:31 +00:00
Philip Meier
0415a64f3e deprecate torch.testing.make_non_contiguous (#72705)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72705

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34457731

Pulled By: mruberry

fbshipit-source-id: 3b9da1740248dd4dc0a799b91f94dfbd2034abad
(cherry picked from commit e71c35e0a561ddd26a6843938837982f07fd27e4)
2022-02-25 06:30:31 +00:00
Philip Meier
0973c5a1cc align signature of make_tensor with other creation ops (#72702)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72702

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34457729

Pulled By: mruberry

fbshipit-source-id: 83d580c4201eef946dc9cf4b9e28a3d36be55609
(cherry picked from commit aa4cf20fbeb4b795595729b8ac2e6ba7707d8283)
2022-02-25 06:30:31 +00:00
Pearu Peterson
5f310c5e27 Testing of masked reductions on mixed layout inputs. (#72398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72398

The design of this feature is discussed in https://github.com/pytorch/pytorch/pull/71239#discussion_r787292751

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D34408881

Pulled By: cpuhrsch

fbshipit-source-id: a362d4220957ea38b7e442df4ecf260ffe682eab
(cherry picked from commit 7fb3611130c08f1aa6ea708ca838708c13b0f01c)
2022-02-25 05:32:47 +00:00
Digant Desai
b2054d3025 Prepare for an update to the XNNPACK submodule (#72642)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
  availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
    - Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
    - This is required for the backwards compatibility test.
      The PyTorch op schema is XNNPACK dependent. See,
      aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
      example. The nightly version is assumed to have USE_XNNPACK=ON,
      so with this change we ensure that the test build can also
      have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642

Reviewed By: kimishpatel

Differential Revision: D34456794

Pulled By: digantdesai

fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
2022-02-25 00:39:15 +00:00
Rohan Varma
199d1cb9dd [FSDP][BE] remove get_full_params() from test code (#73242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73242

Can use summon_full_params instead.
ghstack-source-id: 149800364

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34399789

fbshipit-source-id: 8552cdf3ed003aba1316f554f4ec457fdada5dbe
(cherry picked from commit a397e2dfd3750afe1d21cdee3aa4c2d525ed837e)
2022-02-24 19:39:32 +00:00
Rohan Varma
e10cd88648 [FSDP] summon_full_params fix (#73314)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73314

Needs to synchronize all_gather stream. Added test fails without this
fix
ghstack-source-id: 149800363

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34430602

fbshipit-source-id: 4ce07e2d098a4f07ac640285db1d0ff64fd42232
(cherry picked from commit 24c756e7bba69017b9358bf824589b2aeb366b5e)
2022-02-24 19:39:32 +00:00
Jongsoo Park
fffb97f3cb [torch] do not fold bmm into mm when tensor1 dim==3 but not contiguous (#73115)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73115

matmul for [B, M, K] x [K, N] was mapped to mm by folding the first 2dim of tensor1 to [BxM, K] x [K, N] but when M and K are transposed it's better to use BMM to avoid data movement.

We could generalize the condition we don't fold (see more details in the comment) but being conservative here to be cautious about potential unintended regression.

Test Plan:
In the following simple test case, before this diff

0.00652953577041626 0.003044447898864746
Permutation takes about same time as GEMM

After this diff
0.002983328104019165 0.0030336639881134034
Permutation overhead essentially went away.

```
B = 128
M = 1024
N = 128
K = 1024

X = torch.rand(B, K, M).cuda()
b = torch.rand(N).cuda()
W = torch.rand(N, K).cuda()
X = X.permute(0, 2, 1)
Y = F.linear(X, W, b)

X_contiguous = X.contiguous()
Y_ref = F.linear(X_contiguous, W, b)

torch.testing.assert_close(Y, Y_ref)

t1, _ = benchmark_torch_function(F.linear, X, W, b, 0)

t2, _ = benchmark_torch_function(F.linear, X_contiguous, W, b, 0)

print(t1, t2)
```

Reviewed By: ngimel

Differential Revision: D34350990

fbshipit-source-id: 73e99f785a405cf7a92b909b16f2022b48b1660f
(cherry picked from commit bec995b899710991bb2a304a8009a67f38244114)
2022-02-24 06:29:22 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Nikita Karetnikov
75db05c3fd Check if the iterator is valid before dereferencing it (#72405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72405

Fixes #71674.

This shouldn't segfault now:

```
import torch
d = torch.complex64
torch.set_default_dtype(d)
```

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D34423660

Pulled By: anjali411

fbshipit-source-id: cac92a6f56846f2c0727a120b5f568aa75baa21e
(cherry picked from commit eaab813a0fddced24303b3bd50e4fcdba1516e46)
2022-02-23 18:33:46 +00:00
Rohan Varma
50efa3a6e8 Skip optimizer overlap tests that have issues with NCCL async error handling (#73261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73261

Skip these tests which sometimes have issues on unrelated PRs such as
https://github.com/pytorch/pytorch/runs/5291461671?check_suite_focus=true. See
https://github.com/pytorch/pytorch/issues/73259 for additional detail Skip
these tests which sometimes have issues on unrelated PRs such as
https://github.com/pytorch/pytorch/runs/5291461671?check_suite_focus=true. See
https://github.com/pytorch/pytorch/issues/73259 for additional details.
ghstack-source-id: 149707988

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34404857

fbshipit-source-id: 7889a66730679133100ad022ad9a9934fc5bc9b1
(cherry picked from commit c2b65df9c8b81eab6a31d2827c09cea304f714f6)
2022-02-23 16:31:21 +00:00
Pruthvi Madugundu
595a51b951 [ROCm] Enable sort operator BF16 support (#72854)
Summary:
The changes add support for dtype BF16 for sort operator in ROCm.

Relates - https://github.com/pytorch/pytorch/pull/58196

Relanding the change - https://github.com/pytorch/pytorch/pull/71226

jeffdaily jithunnair-amd dllehr-amd Please review this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72854

Reviewed By: zou3519

Differential Revision: D34284313

Pulled By: malfet

fbshipit-source-id: abcfea84ea53874008d56416425849e990ebf15b
(cherry picked from commit e9e7e3e047)
2022-02-23 15:28:15 +00:00
Xiao Wang
2051068233 Change how cuda available memory is calculated in largeTensorTest decorator (#72207)
Summary:
Related PR https://github.com/pytorch/pytorch/issues/45332

Related discussion https://github.com/pytorch/pytorch/pull/45332#issuecomment-985996064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72207

Reviewed By: ngimel

Differential Revision: D34387921

Pulled By: mruberry

fbshipit-source-id: 2d842a25a5d3d1fc48917ba8fb29ff96d7bc2650
(cherry picked from commit 01a9e980c7)
2022-02-23 02:31:42 +00:00
Samantha Andow
53faf78143 expanded weights without fast rules (#70140)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70140

[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights

Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
 - User facing API is in `_stateless.py` (with documentation)
 - Testing is in test_expanded_weights
 - The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141

Test Plan: Imported from OSS

Reviewed By: mikaylagawarecki

Differential Revision: D34350950

Pulled By: samdow

fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
(cherry picked from commit ae1620d3b6)
2022-02-22 20:35:16 +00:00
Nikita Shulga
cfb6c942fe scatter_reduce documentation (#73125)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/68580 (which were milestoned for 1.11) plus partial revert of https://github.com/pytorch/pytorch/pull/72543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73125

Reviewed By: bdhirsh

Differential Revision: D34355217

Pulled By: malfet

fbshipit-source-id: 325ecdeaf53183d653b44ee5e6e8839ceefd9200
(cherry picked from commit 71db31748a)
2022-02-22 19:33:46 +00:00