Summary:
Reland of https://github.com/pytorch/pytorch/pull/72623 that was reverted for the tls cleanup was removed.
From close inspection on the counting of the number of available keys, I think there is one more since the guard is actually one after the last usable key. With this update assert, the last updated key will still be <=63 which will fit just fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72832
Reviewed By: H-Huang
Differential Revision: D34228571
Pulled By: albanD
fbshipit-source-id: ce5e10a841ea87386727346cfc8d9327252574c4
(cherry picked from commit 59d3b86353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72620
Clarify how LoggingTensor works with autograd.
The updated comment should cover the semantic changes.
Test Plan: Imported from OSS
Reviewed By: samdow
Differential Revision: D34214956
Pulled By: albanD
fbshipit-source-id: 730d0a68f4228d2a84758e6807d869a34cbc1b31
(cherry picked from commit 66110bf16b)
Summary:
A number of ROCm tests were skipped via the skipCUDAIfRocm flag.
A majority of the testcases are now supported on the ROCm platform. This fix enabled all of the test_ops tests for ROCm and enables most Operators in common_methods_invocations.py minus the SpectralFuncInfo class which still has some fft issues.
Partially Fixes https://github.com/pytorch/pytorch/issues/51303
cc jeffdaily sunway513 jithunnair-amd ROCmSupport KyleCZH amathews-amd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67706
Reviewed By: seemethere, janeyx99
Differential Revision: D34153457
Pulled By: malfet
fbshipit-source-id: 95f4420f306ca7580cd438d3b5cc0b24efbfae99
(cherry picked from commit 0d178fffd3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72491
retry #71794, base revision in that stack had been reverted
Test Plan: Imported from OSS
Reviewed By: mikaylagawarecki
Differential Revision: D34062260
Pulled By: davidberard98
fbshipit-source-id: 40fbb2d2de3b10000645e25e7fe89f3ce929f0a2
(cherry picked from commit 917676f076)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72499
Pull Request resolved: https://github.com/pytorch/benchmark/pull/740
To fx2trt out of tree to remove bloatness of PyTorch core.
It's the first and major step. Next, we will move acc_tracer out of the tree and rearrange some fx passes.
Reviewed By: suo
Differential Revision: D34065866
fbshipit-source-id: c72b7ad752d0706abd9a63caeef48430e85ec56d
(cherry picked from commit 91647adbca)
Summary:
This PR ports `index_copy` implementation to structured kernels, also adds an `out` variant.
~Note to the reviewers: This is in draft mode, waiting for the tests from the CI, and I'll give a final look before requesting the review.~
Issue tracker: https://github.com/pytorch/pytorch/issues/55070
cc: bdhirsh ysiraichi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67329
Reviewed By: ejguan
Differential Revision: D34077219
Pulled By: bdhirsh
fbshipit-source-id: 6accda33957f654b753261c5c3d765a27a64d2c0
(cherry picked from commit f3ac83217a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71794
mvlgamma(inp, p) requires that all the elements of inp are > (p-1)/2.
The opinfo test was occasionally producing inputs with elements == (p-1/2), which would generate errors like:
```
ERROR: test_nnc_correctness_mvlgamma_mvlgamma_p_5_cpu_bfloat16 (__main__.TestNNCOpInfoCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 381, in instantiated_test
raise rte
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 376, in instantiated_test
result = test(self, **param_kwargs)
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 753, in test_wrapper
return test(*args, **kwargs)
File "/path/pytorch/torch/testing/_internal/common_device_type.py", line 907, in only_fn
return fn(slf, *args, **kwargs)
File "/path/pytorch/test/test_jit_fuser_te.py", line 2293, in test_nnc_correctness
ref = variant(*clone_inputs((sample.input, *sample.args)), **sample.kwargs)
RuntimeError: All elements must be greater than (p-1)/2
```
repro example: https://gist.github.com/davidberard98/9da688e31cdfbaed7e990746b28a4ba2
Test Plan: Imported from OSS
Reviewed By: qihqi
Differential Revision: D33780905
Pulled By: davidberard98
fbshipit-source-id: c9afd443bc90ce68f33b97498921b447e4f7d1d8
(cherry picked from commit a974b03f07)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70465
These tests check to ensure that
(a) the result after nnc fusion (of a single op) is the same as the
unfused op
(b) for certain ops where fusion is expected to occur, ensure that
fusion does actually occur
Test Plan: Imported from OSS
Reviewed By: wenleix
Differential Revision: D33595240
Pulled By: davidberard98
fbshipit-source-id: e2e17a921bc30c313e92e8e5bbc6c1b5fcd14bc1
(cherry picked from commit b1ba221acc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72348
**Overview**
#43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly.
61ab04e1db/torch/testing/_internal/distributed/distributed_test.py (L4369-L4397)
- `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`.
- IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP.
- This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34011559
Pulled By: awgu
fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72
(cherry picked from commit 8492a8b803)
Summary:
Rest of the tests from CUDA testuite is skipped after GPU context corruption is encountered.
For tests decorated with `expectedFailure` creates false impression that entire testsuite is passing.
Remedy it by suppressing the exception and printing the warning about unexpected success if `should_stop_early` is true
Also, prints warning when this happens (to make attribution easier) as well as when this condition is detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72016
Test Plan:
`python test_ops.py -v -k test_fn_fwgrad_bwgrad_gradient`
Before the change:
```
test_fn_fwgrad_bwgrad_gradient_cpu_complex128 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cpu_float64 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (__main__.TestGradientsCUDA) ... expected failure
----------------------------------------------------------------------
Ran 3 tests in 0.585s
OK (expected failures=1)
```
After the change:
```
test_fn_fwgrad_bwgrad_gradient_cpu_complex128 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cpu_float64 (__main__.TestGradientsCPU) ... ok
test_fn_fwgrad_bwgrad_gradient_cuda_complex128 (__main__.TestGradientsCUDA) ... /home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1670: UserWarning: TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
warn(f"TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with {rte}")
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py:382: UserWarning: Suppressed expected failure that resulted in fatal error
warn("Suppressed expected failure that resulted in fatal error")
unexpected success
----------------------------------------------------------------------
Ran 3 tests in 0.595s
FAILED (unexpected successes=1)
```
And `stderr` from XML file contains requested info:
```
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1670: UserWarning: TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
warn(f"TEST SUITE EARLY TERMINATION due to torch.cuda.synchronize() failed with {rte}")
/home/conda/miniconda3/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py:382: UserWarning: Suppressed expected failure that resulted in fatal error
warn("Suppressed expected failure that resulted in fatal error")
```
Fixes https://github.com/pytorch/pytorch/issues/71973
Reviewed By: janeyx99, ngimel
Differential Revision: D33854287
Pulled By: malfet
fbshipit-source-id: dd0f5a4d2fcd21ebb7ee50ce4ec4914405a812d0
(cherry picked from commit 0c0baf3931)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69735
We want to build a prototype of Megatron-LM so that we can apply PT-D op to models like transformer and other Meta flagship models like
The basic idea of Megatron-LM is as following:
1. Col-wise sharding of linear weight. Perform the linear op for the first layer.
2. Perform a math op (optional), such as ReLU or GeLU. We use GeLU in our example unit test. The input is from step 1.
3. Row-wise sharing of linear weight. Perform the linear op for the second layer. The input is from step 2.
We then save communications to concatenate the col-wise sharding results and spreading the input to different ranks for row-wise sharding.
The change is as following:
1. Return a ShardedTensor for the col-wise sharding in the sharded_linear op.
2. Return a PartialTensors for the row-wise sharding in the sharded_linear op.
3. Leverage APIs already defined for `reshard` to merge/aggregate local results to a fully sync local result if needed.
4. Add helper function to create sharded tensor based on the local result.
5. Add a unit test to test the Megatron-LM idea mentioned above and compare with local ops, including the grad and optimizer so that we can ensure the correctness of the implementation.
6. Refactor the unit test of sharded linear to reflect the changes in the code.
ghstack-source-id: 148273049
Test Plan: Unit test + CI
Reviewed By: pritamdamania87
Differential Revision: D32978221
fbshipit-source-id: 565fc92e7807e19d53b0261f8ace3945bef69e3e
(cherry picked from commit 344abe7520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70079
We defined a new concept named `PartialTensor`, which is an abstraction to represent Tensors that need aggregation across multiple devices and multiple processes.
We also defined a API `reshard_output` to reshard a `PartialTensor` to `Tensor` or reshard a `ShardedTensor` to `ShardedTensor/Tensor`. This is done via class `ModuleResharder` which acts like a wrapper of original modules plus the a reshard in the final step.
The `reshard` logic is defined in each class (`ShardedTensor` and `PartialTensor`).
ghstack-source-id: 148273050
Test Plan: Unit test is in the next PR.
Reviewed By: pritamdamania87
Differential Revision: D33121037
fbshipit-source-id: 5f56617ea526b857c5b73df6e069697d428ec359
(cherry picked from commit 58b1457cbc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084
make fsdp folder to be public
ghstack-source-id: 148173447
Test Plan: unit tests
Reviewed By: mrshenli
Differential Revision: D33903417
fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5
(cherry picked from commit bd06513cfe)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72141
We have many sharding components currently:
torch.distributed._sharded_tensor, torch.distributed._sharding_spec,
torch.distributed._sharded_optimizer and more coming.
As a result, organizing all of this under the `torch.distributed._shard`
package. For BC reasons, I'm still keeping the old packages and have them just
reference the new package.
ghstack-source-id: 148150861
ghstack-source-id: 148150861
Test Plan: waitforbuildbot
Reviewed By: fduwjj
Differential Revision: D33904585
fbshipit-source-id: 057e847eb7521b536a3ee4e0f94871aacc752062
(cherry picked from commit 29a70dd7af)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71427
This commit adds a lowering path for the LinearReLU modules
in static quantization mode. This includes torch.nn.qat.Linear,
torch.nn.intrinsic.LinearReLU, and torch.nn.intrinsic.qat.LinearReLU.
Future commits will add support for dynamic quantization and functional
LinearReLU.
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear_module
Imported from OSS
Reviewed By: george-qi
Differential Revision: D33694742
fbshipit-source-id: 19af11f82b1ad8ade0c307498971c29a3f776036
(cherry picked from commit b3f607de43)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71992
This reverts commit b7222e15b6.
We are conservatively reverting this because it broke a test in functorch.
The original PR added a `_max_pool1d_cpu` operator. I'm not sure if it
is actually safe to revert this due to the addition of the new operator
(someone may have serialized it between now and then) but because it has
only been two weeks this should be fine.
Test Plan: - wait for tests
Reviewed By: jbschlosser, VitalyFedyunin
Differential Revision: D33882918
Pulled By: zou3519
fbshipit-source-id: f146e82e6b46690376b3d8825dc7f7da62e2c7de
(cherry picked from commit 1606333e6c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71742
We have many sharding components currently:
torch.distributed._sharded_tensor, torch.distributed._sharding_spec,
torch.distributed._sharded_optimizer and more coming.
As a result, organizing all of this under the `torch.distributed.shard`
package. For BC reasons, I'm still keeping the old packages and have them just
reference the new package.
ghstack-source-id: 147899768
Test Plan: waitforbuildbot
Reviewed By: fduwjj, wanchaol
Differential Revision: D33755913
fbshipit-source-id: dc692b31e2607063d55dfcb3db33ec53961d5a5b
(cherry picked from commit 5b6885f358)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67996
This is necessary for most matrix decompositions in `linalg`.
cc mruberry
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D33774418
Pulled By: mruberry
fbshipit-source-id: 576f2dda9d484808b4acf0621514c0ffe26834e6
(cherry picked from commit fb07c50aa9)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68522
Some OpInfos were inadvertibly generating samples with `grad_fn`. For
example, when using functions like `transpose()` or `conj()` on the
inputs to generate transposed or conjugated inputs. This PR corrects
this and deactivates the tracking of gradients in all the sampling
functions.
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D33774420
Pulled By: mruberry
fbshipit-source-id: da0e6189a2d67a2cb0fd458054558d36dbad9b61
(cherry picked from commit 42b0870774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69909
This test detected a number of sampling methods that were not generating
the samples as expected, e.g. `index_put`, `cosine_embedding`, `stft`, but
perhaps most notably the generator for `BinOps`.
It also detected that `reminder` and `fmod` did not have implemented the
backward formula for the second input. I added this in the previous PR.
Test Plan: Imported from OSS
Reviewed By: anjali411
Differential Revision: D33774422
Pulled By: mruberry
fbshipit-source-id: 76cfc75b1fdfd72ee64aa524665f83a75fe52509
(cherry picked from commit 13ea7b436b)