Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using `unittest.skip` decorators avoids the starting of the test in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU.
We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- enabled XPU for some test path
- Unify some common code under torch/testing/_internal for multiple backend, for example:
- requires_nccl_version
- _dynamo_dist_per_rank_init
- DynamoDistributedSingleProcTestCase
- DistTestCases
- FSDPTestMultiThread
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533
Approved by: https://github.com/guangyey, https://github.com/d4l3k
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using `unittest.skip` decorators avoids the starting of the test in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)
This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support
The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.
The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs
NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.
I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.
This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.
This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.
Misc changes:
* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)
Test plan:
```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
## Summary
After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.
## Test Changes
`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)
`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.
`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
Fixes#123916
Due to MultiThreadedTestCase we're leaking is_fx_tracing_flag to other tests which causes any dynamo based tests to fail. The test execution order is arbitrary which caused this to not be caught in development.
Test plan:
```sh
pytest --random-order test/distributed/test_functional_api.py -k 'TestMakeFx or test_all_to_all_single_compile_True'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123958
Approved by: https://github.com/yifuwang
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.
This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.
This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.
To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.
Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py
Test plan:
```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.
Switching to the test to use MultiProcessTestCase to provide better
isolation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1.
Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113314
Approved by: https://github.com/wanchaol
We cannot use inner tensors for finalizers as they are uncollective until waited.
This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab