Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73863
This PR fully aligns the convert function with the design: https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md
and simplifies the implementation of convert function by always produce a reference quantized model (with reference patterns) first,
and then lower the model to a quantized model that is runnable with PyTorch native backend (fbgemm/qnnpack).
This PR makes the convert.py much easier to understand than the previous implementation, and we are able to remove majority of code
in quantization_patterns.py as well (in followup PRs).
Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestFXNumericSuiteCoreAPIs
python test/test_quantization.py TestFXNumericSuiteCoreAPIsModels
```
and other internal/oss regression tests
Imported from OSS
Reviewed By: andrewor14
Differential Revision: D34778506
fbshipit-source-id: 0678b66addf736039a8749b352f6f569caca962b
(cherry picked from commit 33ec9caf23f3ab373d827117efbd9db0668b2437)
Summary:
ONNX spec for Einsum requires all inputs to be the same dtype.
PyTorch runtime does not allow executing aten::einsum with
mismatching types by default, so the export would never succeed,
However, when the model is wrapped by `torch.autocast()`,
the run succeeds and the ONNX converter will create an Einsum ONNX node
with mismatch types as input, which is not allowed by the aforementioned schema.
This PR adds onnx::Einsum to the Autocast enabled list, which outputs lower resolution tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71916
Reviewed By: ngimel
Differential Revision: D34629666
Pulled By: malfet
fbshipit-source-id: ec757bb87190a5b7512969e10a32450e9e1f87a1
(cherry picked from commit 7f2b5a6408ae34a6b9f858c3e9f5970b64ca1b4b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842
**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.
The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
- For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
- The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
- A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
- The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
- There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
- This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.
**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34675709
Pulled By: awgu
fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72301
First step in resolving #35026.
This adds `PythonRecordFunction` which is a `torch::CustomClassHolder`
for `at::RecordFunction` to keep the ATen code free of torch includes.
And adds new unused internal API functions
`_record_function_enter_new` which return the torchbind object.
Once the FC period is expired, `torch.profiler.record_function` will
be updated to use this new internal API. Then once BC period is
expired, the cpp_custom_type_hack-based API can be removed.
Test Plan: Imported from OSS
Reviewed By: dagitses
Differential Revision: D34586311
Pulled By: robieta
fbshipit-source-id: d3eb9ffad7b348548a2b22c75203a92d1cb5115b
(cherry picked from commit 92d2ca808e5fbd20c9d6645dcabc3f059f9ef2d3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73741
There are probably more perf improvements that can be made, for example reusing more quantities from forward, doing more things inplace, but in the spirit of improving coverage, this is probably OK for now.
Note: I didn't do anything with half_to_float, but CUDA (locally) hasn't complained yet
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34690141
Pulled By: soulitzer
fbshipit-source-id: fe934e191fee2c8e956d7a5f4b553923adf1b33f
(cherry picked from commit ae49aff7f7c8496e04a3ce7667d8f068ca0a52ec)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73535
**Overview**
- This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker.
- This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor.
- This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like:
```
<built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error
```
NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading.
**Test Plan**
I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`.
After this diff:
- QPS (ResNet): f328439897
- QPS (RoBERTa): f328440141
- Accuracy: f328442119
Before this diff (trunk):
- QPS (ResNet): f328432756
- QPS (RoBERTa): f328436766
- Accuracy: f328437896
Test Plan: Imported from OSS
Reviewed By: zhaojuanmao
Differential Revision: D34533546
Pulled By: awgu
fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c
(cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73771
The runtime for this test doesn't actually depend on the timeout value
specified here. As a result, increasing the timeout to avoid flakiness.
https://ossci-raw-job-status.s3.amazonaws.com/log/4666724994 is an example of
where this test failed due to a small timeout as reported in
https://github.com/pytorch/pytorch/issues/70546
ghstack-source-id: 150507765
Test Plan:
1) waitforbuildbot
2) run the unit test
Reviewed By: mrshenli
Differential Revision: D34632204
fbshipit-source-id: ffe0f40d08f7a36f90f30f493a189608897bbb4c
(cherry picked from commit a4920a4bfcbd26967567b55ee8417e994d53df49)
Summary:
Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325.
Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs.
In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285
Reviewed By: mrshenli
Differential Revision: D34457792
Pulled By: rohan-varma
fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba
(cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843
# [Debug Story] Training Hanging and DDP Bucketing
**What are the characteristics of the hanging training instance?**
The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution.
The model config difference to trigger this hanging issue is turning on position weighted embedding tables.
A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli).
**What was the initial manifested error?**
The training was stuck in the first iteration.
**What are useful debugging tools this time?**
After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure.
After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all.
With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks.
**What is DDP bucketing?**
Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks.
Say we have 4 tensor ops. A, B, C, D.
In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready.
The time sequence would be,
* D.grad
* C.grad
* B.grad
* A.grad
* All reduce on [D.grad, C.grad, B.grad, A.grad].
But that would be a huge waste of communication channel bandwidth.
With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes,
* D.grad
* C.grad
* All reduce on [D.grad, C.grad].
* B.grad
* A.grad
* All reduce on [B.grad, A.grad].
With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance.
**What exactly went wrong in this case?**
1. The bucketing doesn’t honor backward graph execution order.
2. There are other collectives comm ops in backward graph.
3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world.
Using the above example again, we have 4 tensor ops. A, B, C, D.
Say we have 2 trainers,
B is the feature processor module.
B only runs on trainer 0 (both forward and backward), but not on trainer1.
C is the All-to-all (Pooled embeddings distribution).
C sends out all-to-all collective in both its forward and backward pass.
Keep assuming all other ops run on both trainers.
trainer_0 op sequence is,
A, B (feature preproc), C (all-to-all), D | D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad
trainer_1 op sequence is,
A, C (all-to-all), D | D.grad, C.grad (reverse all-to-all), A.grad
Even though the correct bucketing should be (same bucketing for both ranks),
* bucket_0, [D.grad, C.grad]
* bucket_1, [B.grad, A.grad]
but because of 1), they end up like,
* bucket_0, [B.grad, D.grad]
* bucket_1, [C.grad, A.grad]
Plus 2) and 3), the time sequence could like,
(check mark represents the gradient is ready)
(bucket is ready to do synchronization if all its enclosing gradients are ready)
* trainer_0
* t0,
* D.grad
* bucket_0, [B.grad, D.grad ✓]
* t1,
* **C.grad all-to-all**
* C.grad ✓
* bucket_1, [C.grad ✓, A.grad]
* t2
* B.grad
* bucket_0, [B.grad ✓, D.grad ✓] ✓
* t3
* All-reduce for bucket_0
* t4
* A.grad
* bucket_1, [C.grad ✓, A.grad ✓] ✓
* trainer_1
* t0,
* D.grad
* bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.)
* t1,
* **All-reduce for bucket_0**
* t2
* C.grad all-to-all
* bucket_1, [C.grad ✓, A.grad]
* t3
* A.grad
* bucket_1, [C.grad ✓, A.grad ✓] ✓
This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce.
**What is the solution for fixing DDP?**
Disable DDP bucketing for the first iteration. D34051938
This is because after the first iteration, buckets will be built again based on real backward graph execution order.
So the slow gradient synchronization only affects the first iteration.
Test Plan:
buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn
BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu
P484179296
buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn
BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes
P484177200
Reviewed By: zhaojuanmao
Differential Revision: D34051938
fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9
(cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73642
Former https://github.com/pytorch/pytorch/pull/73471 that was reverted due to lack of `to_sparse(sparse_dim)` support.
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D34580353
Pulled By: cpuhrsch
fbshipit-source-id: a8a4ea381daeb80d8365fe931af9f55a7e789ea1
(cherry picked from commit 5a3cf8110980e5a10dbb687e87e67d5524ebf2f5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73116
Users may need summon_full_params() to get the original parameters.
ghstack-source-id: 150134237
Test Plan: CI
Reviewed By: rohan-varma
Differential Revision: D34353034
fbshipit-source-id: ac69cc032da177903cd9969094f3f82dc6a61636
(cherry picked from commit 55d34fdee3778110a165a13ae987d0339e8d33c7)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73551
Rename to better indicate what it is.
ghstack-source-id: 150166352
Test Plan: CI
Reviewed By: awgu
Differential Revision: D34537964
fbshipit-source-id: 5465003c2a2fd6f1a2646c375bc7c11d297e3f9e
(cherry picked from commit 9f11bdef88c7886b59fedc939e7149872ad73453)
Summary:
This PR introduces the `cuSolverSP` backend for `linalg.solve` with sparse CSR input matrices. The motivation comes from the issue: https://github.com/pytorch/pytorch/issues/69538.
`cuSolver` provides [`cusolverSp<t>csrlsvluHost`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu) API, a few things to note:
1. As mentioned in the documentation: `only CPU (Host) path is provided.` From the profiling, there doesn't seem to be any GPU kernel launch for optimization, please see the profiling below.
2. Since only `host` path is provided, the CPU path uses `csrlsvluHost` (but requires PyTorch to be installed/built with CUDA support).
3. The documentation mentions reordering helps optimize stuff, but it isn't clear how it affects the performance. There are options for reordering, so we stick to `reorder = 0` as the default choice.
`cuSolver` has [`csrlsvqr`](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr) function which provides a `device` path to solve the linear system. This function is used for the CUDA path in this PR.
**Gist:**
For CPU Path: we call [`csrlsvluHost` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu).
For CUDA Path: we call [`csrlsvqr` function of cuSolver](https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvqr).
**Profiling:** (On sparse input tensor of size 1000 x 1000, with a vector of shape length 1000), for `csrlsvlu` function (to show no GPU optimization)
```cpp
==3999651== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 2.1440us 1 2.1440us 2.1440us 2.1440us [CUDA memcpy HtoD]
API calls: 99.72% 1.07199s 9 119.11ms 500ns 1.07164s cudaFree
0.11% 1.2182ms 398 3.0600us 140ns 137.94us cuDeviceGetAttribute
0.06% 674.45us 4 168.61us 165.50us 173.64us cuDeviceTotalMem
0.03% 357.07us 4 89.268us 2.7800us 201.89us cudaMalloc
0.03% 309.29us 1 309.29us 309.29us 309.29us cudaGetDeviceProperties
0.01% 160.47us 332 483ns 350ns 3.3300us cudaFuncSetAttribute
0.01% 115.12us 4 28.780us 26.290us 33.410us cuDeviceGetName
0.00% 28.591us 5 5.7180us 440ns 16.921us cudaGetDevice
0.00% 22.061us 4 5.5150us 871ns 18.690us cudaDeviceSynchronize
0.00% 20.370us 18 1.1310us 410ns 6.9900us cudaEventDestroy
0.00% 16.390us 1 16.390us 16.390us 16.390us cudaMemcpy
0.00% 11.540us 2 5.7700us 1.4900us 10.050us cuDeviceGetPCIBusId
0.00% 10.510us 18 583ns 430ns 1.6200us cudaEventCreateWithFlags
0.00% 7.9100us 21 376ns 290ns 700ns cudaDeviceGetAttribute
0.00% 1.4300us 6 238ns 150ns 590ns cuDeviceGet
0.00% 1.2200us 4 305ns 190ns 500ns cuDeviceGetCount
0.00% 900ns 1 900ns 900ns 900ns cuInit
0.00% 860ns 4 215ns 180ns 260ns cuDeviceGetUuid
0.00% 240ns 1 240ns 240ns 240ns cuDriverGetVersion
0.00% 230ns 1 230ns 230ns 230ns cudaGetDeviceCount
```
Script:
```python
import torch
def solve(x, other, out):
torch.linalg.solve(x, other, out=out)
if __name__ == "__main__":
dense_inp = torch.randn((1000, 1000), dtype=torch.float64)
# Set 50% of the values to 0 randomly
dense_inp = torch.nn.functional.dropout(dense_inp, p=0.5)
sparse_inp = dense_inp.to_sparse_csr()
other = torch.randint(100, (1000,), dtype=torch.float64)
out = torch.randint(1, (1000,), dtype=torch.float64)
solve(sparse_inp, other, out)
```
The following error is raised when the function is used on a CPU device with PyTorch built/installed without CUDA support:
* When built without CUDA support:
```python
/home/krshrimali/pytorch/torch/autograd/profiler.py:151: UserWarning: CUDA is not available, disabling CUDA profiling
warn("CUDA is not available, disabling CUDA profiling")
Traceback (most recent call last):
File "/home/krshrimali/pytorch/test_sp.py", line 17, in <module>
solve(x, other, out)
File "/home/krshrimali/pytorch/test_sp.py", line 5, in solve
torch.linalg.solve(x, other, out=out)
RuntimeError: PyTorch was not built with CUDA support. Please use PyTorch built CUDA support
```
**Performance Comparison** (vs SciPy's [`scipy.sparse.linalg.spsolve`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.spsolve.html):
Time taken by `scipy.sparse.linalg.spsolve` : 0.595 seconds
On CPU: Time taken by `torch.linalg.solve` : 4.565 seconds
On CUDA: Time taken by `torch.linalg.solve`: 1.838 seconds
The inputs are of dimensions: (17281, 17281) and (17281, 1), and were taken from https://math.nist.gov/MatrixMarket/extreme.html.
Thanks to IvanYashchuk for helping me with the PR, and guiding me through it.
cc: IvanYashchuk pearu nikitaved cpuhrsch
cc nikitaved pearu cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71399
Reviewed By: VitalyFedyunin
Differential Revision: D33767740
Pulled By: cpuhrsch
fbshipit-source-id: a945f065210cd719096eb8d7cdbf8e8937c2fce9
(cherry picked from commit f4f35c17da414e1ca6c6d91402933521857aa1ea)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73366
Adds state_dict() save/reload in parity with DDP test to ensure
checkpointing doesn't cause issue with accuracy/model params.
ghstack-source-id: 150114251
Test Plan: CI
Reviewed By: fegin
Differential Revision: D34434358
fbshipit-source-id: fb0787486b383cfcbec7cc1325a486c8d9b1e2ea
(cherry picked from commit e3bcc7733cb5a497a640007044b1138dfee3a532)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73371
This PRs allows for the pybinded class `WorkerInfo` to be pickled. The class is pickled into a tuple of worker_name and rank in format `(NAME, ID)`. This allows WorkerInfo to be passed as an argument for RPC calls.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34458153
Pulled By: H-Huang
fbshipit-source-id: 7b8f99960bdc0e24021e252d8c8138bcb53f698c
(cherry picked from commit 8fb119bf760eef9f313a44e9287c9253cbb09cae)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73324
Implements `state_dict` and `load_state_dict` APIs for FSDP, with the following limitations:
1. Does not support `state_dict_device` (i.e. specifying which device params should be on) which fairscale does currently support
2. Does not yet support offload of state_dict onto CPU
3. Loads state_dict on all ranks currently. In the future we could add support for loading this on only rank 0, to avoid redundancy across ranks as usually only one rank is responsible for saving/loading the model. Along with (2) this would enable larger models to have state_dict called.
As discussed in FSDP checkpoint API proposal, `state_dict` will basically be a `full_state_dict` where full parameters are returned on all ranks. As a result this implies that the model must actually be able to fit on a single GPU.
ghstack-source-id: 150012240
Test Plan: ci
Reviewed By: zhaojuanmao
Differential Revision: D34433514
fbshipit-source-id: 3eb1d679b2236264f9f423e761d1720f9aaec73a
(cherry picked from commit a451d5a08ebfa14a229a25fea35b9ca59fe91a59)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73276
There are two major cases when find_unused_parameters=True:
1. grad ready order does not change over iterations, in this case, enable rebuilt bucket after first iteration can potentially improve performance
2. grad ready order changes over iterations, in this case, use static bucket order or dynamic bucket order in the first iteration does not matter much, as order changes per iteration
ghstack-source-id: 149820812
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D34410523
fbshipit-source-id: 73284c3629ff2696de76681f070b74ad2bb01f1b
(cherry picked from commit fa3a54bdd659669b776439190039ad889cf3371f)
Summary:
- Target Sha1: ae108ef49aa5623b896fc93d4298c49d1750d9ba
- Make USE_XNNPACK a dependent option on cmake minimum version 3.12
- Print USE_XNNPACK under cmake options summary, and print the
availability from collet_env.py
- Skip XNNPACK based tests when XNNPACK is not available
- Add SkipIfNoXNNPACK wrapper to skip tests
- Update cmake version for xenial-py3.7-gcc5.4 image to 3.12.4
- This is required for the backwards compatibility test.
The PyTorch op schema is XNNPACK dependent. See,
aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp for
example. The nightly version is assumed to have USE_XNNPACK=ON,
so with this change we ensure that the test build can also
have XNNPACK.
- HACK: skipping test_xnnpack_integration tests on ROCM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72642
Reviewed By: kimishpatel
Differential Revision: D34456794
Pulled By: digantdesai
fbshipit-source-id: 85dbfe0211de7846d8a84321b14fdb061cd6c037
(cherry picked from commit 6cf48e7b64d6979962d701b5d493998262cc8bfa)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73314
Needs to synchronize all_gather stream. Added test fails without this
fix
ghstack-source-id: 149800363
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D34430602
fbshipit-source-id: 4ce07e2d098a4f07ac640285db1d0ff64fd42232
(cherry picked from commit 24c756e7bba69017b9358bf824589b2aeb366b5e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73115
matmul for [B, M, K] x [K, N] was mapped to mm by folding the first 2dim of tensor1 to [BxM, K] x [K, N] but when M and K are transposed it's better to use BMM to avoid data movement.
We could generalize the condition we don't fold (see more details in the comment) but being conservative here to be cautious about potential unintended regression.
Test Plan:
In the following simple test case, before this diff
0.00652953577041626 0.003044447898864746
Permutation takes about same time as GEMM
After this diff
0.002983328104019165 0.0030336639881134034
Permutation overhead essentially went away.
```
B = 128
M = 1024
N = 128
K = 1024
X = torch.rand(B, K, M).cuda()
b = torch.rand(N).cuda()
W = torch.rand(N, K).cuda()
X = X.permute(0, 2, 1)
Y = F.linear(X, W, b)
X_contiguous = X.contiguous()
Y_ref = F.linear(X_contiguous, W, b)
torch.testing.assert_close(Y, Y_ref)
t1, _ = benchmark_torch_function(F.linear, X, W, b, 0)
t2, _ = benchmark_torch_function(F.linear, X_contiguous, W, b, 0)
print(t1, t2)
```
Reviewed By: ngimel
Differential Revision: D34350990
fbshipit-source-id: 73e99f785a405cf7a92b909b16f2022b48b1660f
(cherry picked from commit bec995b899710991bb2a304a8009a67f38244114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70140
[Design Doc for Expanded Weights](https://gist.github.com/samdow/fa0a164fec7963f93ff45284989cfc55) <-- gives an overview of the design for Expanded Weights
Introduces the ExpandedWeights mechanism and user-facing API without any custom implemented, faster rules.
- User facing API is in `_stateless.py` (with documentation)
- Testing is in test_expanded_weights
- The rest is the implementation of the erroring fallback + the mechanism for being able to register faster per sample grad rules. Only linear is implemented here, but they are all implemented in #70141
Test Plan: Imported from OSS
Reviewed By: mikaylagawarecki
Differential Revision: D34350950
Pulled By: samdow
fbshipit-source-id: 69c664b0bc3dff6951358d79d7e5d94882f7aef2
(cherry picked from commit ae1620d3b6)