This test(8340762211/test/distributed/test_multi_threaded_pg.py (L133) ) is failing on internal sandbox with the following error msg:
```
File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
exiting thread 1
ERROR
```
Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0
We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.
cc. @kumpera @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568
Approved by: https://github.com/H-Huang
currently the test
```
pytest test/distributed/test_multi_threaded_pg.py -vs
```
has errors
```
Traceback (most recent call last):
File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run
self._tls.precision = TestCase._precision
AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883
Approved by: https://github.com/awgu, https://github.com/wanchaol
Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing!
using MultiThreadTestCase on a AWS dev node:
```
time pytest test/distributed/_tensor/test_dtensor_ops.py
============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) =======
real 1m22.330s
user 1m38.782s
sys 0m18.762s
```
MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198
Approved by: https://github.com/XilunWu
This PR did a full rewrite of MultiThreadedTestCase, to make it more
aligned with the MultiProcessTestCase, also changed how it do spawning
and testing, so that we could embed thread local states when running
tests.
This PR enables device_type tests to work with MultiThreadedTestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650
Approved by: https://github.com/XilunWu
This PR includes:
Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:
```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```
test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.
[T134844615]
## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
Add sequence number support for UCC, mostly following format of ProcressGroupNCCL.
Pass new test: `test_all_gather_object_subgroup`
Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup`
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047
Approved by: https://github.com/kwen2501
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
- [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
- [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
- [x] Fixes profiling of `recv` and `all_gather`
- [x] Use the NCCL TL of UCC on CUDA, as the UCP TL is not well supported on CUDA
Most tests are passing, but there are a few skipped tests:
- `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
- A few flaky tests in PyTorch's CI environment
- Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368
After this PR is merged, I will continue to work on these skipped failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501
When running on clusters the startup time for the subprocesses might be much higher which leads to spurious failures.
So increase this to 300s similar to torch/testing/_internal/distributed/distributed_test.py
Also introduces `DISTRIBUTED_TESTS_DEFAULT_TIMEOUT` as suggested by @malfet in #55896
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80330
Approved by: https://github.com/malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753
As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:
1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.
Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)
Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842
**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.
The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
- For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
- The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
- A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
- The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
- There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
- This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.
**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D34675709
Pulled By: awgu
fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)