Commit Graph

115 Commits

Author SHA1 Message Date
Rodrigo Kumpera
3c841163ce [C10D] Implement new libuv backend for TCPStore. (#105870)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870
Approved by: https://github.com/H-Huang
2023-08-17 20:40:32 +00:00
Rohan Varma
4137d6e499 [Composable FSDP] Enable HSDP (#105206)
Need to pass in strategy to _init_process_group_state to enable hsdp
for composable.

Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206
Approved by: https://github.com/awgu, https://github.com/fegin
2023-07-26 21:03:55 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Justin Chu
be03a56955 [BE] Enable ruff's UP rules and autoformat testing/ (#105425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425
Approved by: https://github.com/malfet
2023-07-18 21:04:39 +00:00
Iris
15eed5b73e [Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568)
This test(8340762211/test/distributed/test_multi_threaded_pg.py (L133) ) is failing on internal sandbox with the following error msg:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
    raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
 exiting thread 1
ERROR
```

Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0

We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.

cc. @kumpera @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568
Approved by: https://github.com/H-Huang
2023-06-18 07:05:28 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Edward Z. Yang
5a7aad9681 Convert logging f-strings to use % format, part four (#98705)
This does multi-line concatenated string literals.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705
Approved by: https://github.com/voznesenskym
2023-04-11 13:17:59 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Horace He
5bbec680d7 Fix usages of contextmanager without finally (#96170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-08 20:59:27 +00:00
Sergii Dymchenko
35bf5bac26 Fix "sandcastle_skip_if decorator name is confusing" (#95649)
Fixes https://github.com/pytorch/pytorch/issues/89473
See the issue https://github.com/pytorch/pytorch/issues/89473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649
Approved by: https://github.com/atalman, https://github.com/malfet
2023-03-03 09:29:40 +00:00
fduwjj
a88bfc60c7 [2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450)
No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450
Approved by: https://github.com/wanchaol
2023-02-26 03:03:37 +00:00
Howard Huang
5c7f4534e9 [small] multithreaded-pg guard attr (#93883)
currently the test
```
pytest test/distributed/test_multi_threaded_pg.py -vs
```

has errors

```
Traceback (most recent call last):
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run
    self._tls.precision = TestCase._precision
AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-02-03 23:01:02 +00:00
Xilun Wu
966030f7c7 [DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
Will Constable
ac791bddce Refactor dynamo distributed test helpers to be reusable (#93187)
The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187
Approved by: https://github.com/kumpera
2023-02-01 06:09:42 +00:00
Wanchao Liang
06d54b4061 [threaded_pg] fix the comments of MultiThreadTestCase (#92373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92373
Approved by: https://github.com/wz337
2023-01-19 03:42:54 +00:00
Wanchao Liang
801d831d7a [dtensor] enable op db tests by using multithreaded test case (#92198)
Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing!

using MultiThreadTestCase on a AWS dev node:
```
time pytest test/distributed/_tensor/test_dtensor_ops.py

============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) =======

real    1m22.330s
user    1m38.782s
sys     0m18.762s
```
MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198
Approved by: https://github.com/XilunWu
2023-01-17 03:26:38 +00:00
Wanchao Liang
e16979c9a0 [threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650)
This PR did a full rewrite of MultiThreadedTestCase, to make it more
aligned with the MultiProcessTestCase, also changed how it do spawning
and testing, so that we could embed thread local states when running
tests.

This PR enables device_type tests to work with MultiThreadedTestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650
Approved by: https://github.com/XilunWu
2023-01-17 03:26:36 +00:00
Xilun Wu
a6dcebf997 [threaded pg] make exception handling consistent with MultiProcessTestCase (#90712)
Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
Xilun Wu
34da446072 [threaded pg] add assertion util to MultiThreadedTestCase (#90595)
Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
Shen Li
80542add73 [FSDP] Allow MixedPrecision to skip inputs (#90620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90620
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-12-11 06:39:38 +00:00
Shen Li
082450609c [FSDP] Allow nested FSDP wrapper to use different mixed precision (#90523)
The main change is to move `args` and `kwargs` dtype convertion
from `_root_pre_forward` to `_pre_forward`, so that every
FSDP has a chance to apply its own precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90523
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-12-09 20:06:05 +00:00
Xilun Wu
3759777edc [threaded PG] fix long hang issue in testing (#90515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515
Approved by: https://github.com/wanchaol
2022-12-09 05:24:08 +00:00
Iris
0cc0e5ef65 [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987)
This PR includes:

Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
2022-11-30 08:19:41 +00:00
Kazuaki Ishizaki
1cd6ebe095 Fix typos in messages under torch (#89049)
This PR fixes typos of messages in `.py` files under torch directory.
Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049
Approved by: https://github.com/lezcano
2022-11-17 04:18:14 +00:00
Fuzzkatt
8ba62bdff5 add test_c10d_spawn_ucc.py (#86508)
Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and
https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508
Approved by: https://github.com/kwen2501
2022-11-16 22:50:11 +00:00
Charlie Yan
ee05f47bdd Rebase and re-land thread PG (#88795)
The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795
Approved by: https://github.com/huydhn, https://github.com/wanchaol
2022-11-15 21:58:58 +00:00
PyTorch MergeBot
c7fc710459 Revert "[3/n] Thread PG: add threaded PG implementation (#88627)"
This reverts commit 6dd081846e.

Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test 6dd081846e in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged
2022-11-09 22:38:41 +00:00
Charlie Yan
6dd081846e [3/n] Thread PG: add threaded PG implementation (#88627)
Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation.

Test Plan: TBD

Reviewed By: XilunWu

Differential Revision: D40992593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-11-09 20:51:11 +00:00
Will Constable
70b00b1383 Add hf_bert + DDP multigpu test (#88435)
Spot-checks an e2e model working with ddp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435
Approved by: https://github.com/davidberard98
2022-11-04 03:17:48 +00:00
Fuzzkatt
d13f1e6ab4 Add sequence number support for UCC (#85047)
Add sequence number support for UCC, mostly following format of ProcressGroupNCCL.
Pass new test: `test_all_gather_object_subgroup`
Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup`

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047
Approved by: https://github.com/kwen2501
2022-10-31 03:56:55 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
PyTorch MergeBot
6fae62b35f Revert "C10D extension to enable per-thread PG (#84153)"
This reverts commit 5cbffbbac9.

Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff
2022-09-29 13:51:05 +00:00
Rodrigo Kumpera
5cbffbbac9 C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
2022-09-27 21:42:31 +00:00
Xiang Gao
08c4f8c7a7 ProcessGroupUCC tests (#83285)
- [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
- [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
- [x] Fixes profiling of `recv` and `all_gather`
- [x] Use the NCCL TL of UCC on CUDA, as  the UCP TL is not well supported on CUDA

Most tests are passing, but there are a few skipped tests:
- `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
- A few flaky tests in PyTorch's CI environment
- Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368

After this PR is merged, I will continue to work on these skipped failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501
2022-09-10 10:56:05 +00:00
Alexander Grund
acb11da556 Increase default test timeout for distributed tests (#80330)
When running on clusters the startup time for the subprocesses might be much higher which leads to spurious failures.
So increase this to 300s similar to torch/testing/_internal/distributed/distributed_test.py

Also introduces `DISTRIBUTED_TESTS_DEFAULT_TIMEOUT` as suggested by @malfet in #55896
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80330
Approved by: https://github.com/malfet
2022-09-05 21:23:50 +00:00
Rodrigo Kumpera
dac3fba274 Add testing workaround for EFA and TensorPipe (#77363)
This is a workaround for EFA for TensorPipe.

This allows RPC enabled tests to be ran on AWS clusters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77363
Approved by: https://github.com/wanchaol
2022-05-18 22:54:15 +00:00
Arindam Roy
24372eb5ac ROCM: Increase timeout for flaky test_with_kwargs (#76706)
Fixes #75665.

Very rarely the test is failing on ROCM at boundary
of the current timeout set to 100 seconds.
Setting the timeout to 200 as default for ROCM,
for only this test, to be on the safe side.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76706
Approved by: https://github.com/janeyx99, https://github.com/pruthvistony
2022-05-10 00:10:28 +00:00
Jagadish Krishnamoorthy
6ca8272d46 [Distributed tests] Add skip for odd world_size condition
As per https://github.com/pytorch/pytorch/issues/74995, the tests
needs to be skipped for odd WORLD_SIZE

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes #74995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76136
Approved by: https://github.com/kumpera, https://github.com/wayi1
2022-04-24 22:04:30 +00:00
pritam
3a38f175dd Convert DDP parameters to ReplicatedTensor during forward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753

As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:

1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.

Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)

Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
2022-04-18 03:27:23 +00:00
Andrew Gu
9012e8d65a [ZeRO][BE] Clean up ZeRO tests (#73842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73842

**Overview**
This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:
- Using `parametrize` instead of manually including `for` loops over possible argument values
- Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed`
- Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness
- Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed`
    - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`.
    - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
    - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.
- Renaming `test_multiple_groups()` to `test_nondefault_process_group()`
    - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.
    - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:
1d497114e7/test/distributed/optim/test_zero_redundancy_optimizer.py (L658-L684)
- Changing `_test_zero_model_parallel()` to not use CPU
    - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

**Questions**
- How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34675709

Pulled By: awgu

fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb
(cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)
2022-03-08 13:15:20 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Wanchao Liang
6feba4bc7e Implement scatter primitive for ProcessGroupNCCL (#70029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70029

This PR implements NCCL scatter and add scatter to ProcessGroupNCCL.

NCCL doesn’t directly provide primitives for scatter, so we need to be implemented on top of NCCL’s send/recv API.

1. In ProcessGroupNCCL.cpp, the inputTensors are first flattened, then outputTensors and inputFlattened are passed by the collective class to scatter() function in nccl.cpp.
2. In nccl.cpp, scatter is implemented using ncclSend/ncclRecv: the root rank uses a for loop to send(distribute) the inputTensors to each rank, then all the ranks receive the inputTensor from the root rank.
ghstack-source-id: 147754837

Test Plan:
test_scatter_ops
test_scatter_stress
test_scatter_checks

Reviewed By: pritamdamania87

Differential Revision: D33154823

fbshipit-source-id: 4513e7eaf7d47a60eb67da99dc6c2e9a2882f3fd
(cherry picked from commit 93201f9d4a)
2022-01-27 19:37:55 +00:00
Wanchao Liang
9b53d3194c Implement gather primitive for ProcessGroupNCCL (#66745)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66745

This PR implement NCCL gather and add gather to ProcessGroupNCCL using nccl send/recv api.

NCCL doesn’t directly provide primitives for gather, so we need to be implemented on top of NCCL’s send/recv API.
1. In ProcessGroupNCCL.cpp, the outputTensors are first flattened, then inputTensors and outputFlattened are passed by the collective class to gather() function in nccl.cpp.
1. In nccl.cpp, gather is implemented using ncclSend/ncclRecv: all the ranks send inputTensor to the root rank, and the root rank uses a for loop to receive these inputTensors.
ghstack-source-id: 147754838

Test Plan:
test_gather_ops
test_gather_checks
test_gather_stress

Reviewed By: pritamdamania87

Differential Revision: D29616361

fbshipit-source-id: b500d9b8e67113194c5cc6575fb0e5d806dc7782
(cherry picked from commit d560ee732e)
2022-01-27 19:37:55 +00:00
Rohan Varma
2bed616e0f [Dist tests] Make event_listener work for all dist tests (#70628)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70628

event_listener thread is used to log process tracebacks when a timed
out process sends it a request to get its traceback. Although, this thread is
created in `_run` function which is overridden by some classes such as
`TestDistBackend` so those tests did not have this feature. Move the
event_listener setup logic to `run_test` which is called by all distributed
test classes, which enables it for all distributed tests. Also modify logger
setup to ensure that logging.info calls are printed in the subprocess.
ghstack-source-id: 146714642

Test Plan: CI

Reviewed By: jaceyca, fduwjj

Differential Revision: D33410613

fbshipit-source-id: aa616d69d251bc9d04e45781c501d2244f011843
2022-01-09 14:54:09 -08:00
Bryan Reese
51b6981c36 [PyTorch Tests] Split out skip logic, make changes for plugins (#67256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67256

To change what tests can be run in various cases, the check logic should be moved to functions and variables that can be changed.

One challenge here is that decorators don't have dynamic functionality. If something is read in when imported and then changed afterwards, it will not actually change. This means we need to separate out the variables that need to be changed for our use case.

Those are put into common_distributed.py and can be changed before importing the distributed_test.py code.

The use case is to add new backends to the tests and split it into tests that can be ran on demand as a separate instance. To do so, you would change DistTestSkipCases after importing it into a launcher or a setup script and then load distributed_test.

Test Plan: Check the signals

Reviewed By: mrshenli

Differential Revision: D31906947

fbshipit-source-id: 45e3258c55f4dc34e12a468bed65280f4c25748f
2021-12-08 12:23:15 -08:00
Rohan Varma
d44d59aa70 [BE] Enable C++ stacktraces for MultiProcessTestCase (#69175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69175

Shows C++ stacktraces for python distributed tests that inherit from
MultiProcessTestCase. Closes https://github.com/pytorch/pytorch/issues/69168
ghstack-source-id: 145085858

Test Plan: CI

Reviewed By: H-Huang

Differential Revision: D32736872

fbshipit-source-id: 743e870eefa7a9e77c5791d0936e2ebd5c9b1016
2021-12-08 11:57:51 -08:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00