Commit Graph

21 Commits

Author SHA1 Message Date
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
PyTorch MergeBot
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01df.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
cyyever
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
Xuehai Pan
db3290846e [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761
Approved by: https://github.com/fegin
2024-07-17 16:57:39 +00:00
Chien-Chin Huang
7420bad74c [BE] Do not assert if the barrier is not created (#129497)
the foler will be created as long as TEMP_DIR is set and the program
has the write permission. This will ensure some test environment can run the
spawn tests.

Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-06-26 05:51:36 +00:00
Yuanhao Ji
e3effa5855 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-17 06:46:02 +00:00
PyTorch MergeBot
52be63eb2c Revert "Enable UFMT on all of test/distributed (#123539)"
This reverts commit 89ac37fe91.

Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))
2024-04-16 06:33:21 +00:00
Yuanhao Ji
89ac37fe91 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-16 03:23:56 +00:00
Will Constable
418c5206ec Make test_distributed_spawn.py tell you how to run it correctly (#112924)
Sample output if incorrect/missing args are specified:

```
RuntimeError: Missing expected env vars for `test_distributed_spawn.py`.  Please
ensure to specify the following:
'BACKEND' = one of ('gloo', 'nccl', 'ucc')
'WORLD_SIZE' = int >= 2
'TEMP_DIR' specifying a directory containing a barrier file named
'barrier'.

e.g.
touch /tmp/barrier && TEMP_DIR=/tmp BACKEND='nccl' WORLD_SIZE=2 python
/data/users/whc/pytorch/test/distributed/test_distributed_spawn.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112924
Approved by: https://github.com/wanchaol
2023-11-04 02:43:43 +00:00
Rohan Varma
f044613f78 Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938)
Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-22 21:55:58 +00:00
Huy Do
b1ddd5a293 Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)
Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack.  This seems like a safer option than using the bot as the commit has already been in trunk since last week.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873
Approved by: https://github.com/rohan-varma
2023-06-20 16:25:00 +00:00
Rohan Varma
80139fc2db [DDP] multiple forward support for static graph (#103487)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
Jeff Daily
72502b94f3 correct use of torch.backends.cudnn.flags() (#93182)
Fixes #77467.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93182
Approved by: https://github.com/ngimel
2023-01-28 06:50:06 +00:00
Xiang Gao
08c4f8c7a7 ProcessGroupUCC tests (#83285)
- [x] Direct dependency on UCX is completely removed, UCC active set API always enabled
- [x] Remove `TORCH_UCC_PROFILING_ENABLE`, always enable profiling
- [x] Fixes profiling of `recv` and `all_gather`
- [x] Use the NCCL TL of UCC on CUDA, as  the UCP TL is not well supported on CUDA

Most tests are passing, but there are a few skipped tests:
- `scatter` and `gather` are not supported by the UCP TL of UCC on CPU tensors
- A few flaky tests in PyTorch's CI environment
- Profiler-related failures, some of them will be fixed by @Fuzzkatt in https://github.com/pytorch/pytorch/pull/84368

After this PR is merged, I will continue to work on these skipped failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83285
Approved by: https://github.com/vtlam, https://github.com/malfet, https://github.com/kwen2501
2022-09-10 10:56:05 +00:00
Jane Xu
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
Pritam Damania
f7611b31aa [4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051

The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.

This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29854514

fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
2021-08-10 22:38:31 -07:00
Pritam Damania
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
Xiang Gao
dfb5f029da Disable TF32 on DDP tests (#52941)
Summary:
When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52941

Reviewed By: albanD

Differential Revision: D26994287

Pulled By: mrshenli

fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d
2021-03-11 18:31:28 -08:00
Hong Xu
1b35b1a0c4 Properly skip distributed tests when distributed module is not built (#52945)
Summary:
Currently there is some code that intends to skip distributed tests if
the distributed module is not built. However, they are missing in some
test files; and in some other test files they are checked after
distributed module is imported, which leads to failure.  This is
generating a lot of headaches when testing minimal builds locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52945

Reviewed By: anjali411

Differential Revision: D26848241

Pulled By: ezyang

fbshipit-source-id: 983a848844add40869a86f3c9413503a3659b115
2021-03-05 10:28:47 -08:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Rohan Varma
b22abbe381 Enable test_distributed to work with spawn mode (#41769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41769

Currently the tests in `test_distributed` only work with the `fork` mode multiprocessing, this PR introduces support for `spawn` mode multiprocessing as well (while keeping the `fork` mode intact).

Motivations for the change:
1) Spawn multiprocessing is the default on MacOS, so it better emulates how MacOS users would use distributed
2) With python 3.8+, spawn is the default on linux, so we should have test coverage for this
3) PT multiprocessing suggests using spawn/forkserver over fork, for sharing cuda tensors: https://pytorch.org/docs/stable/multiprocessing.html
4) Spawn is better supported with respect to certain sanitizers such as TSAN, so adding this sanitizer coverage may help us uncover issues.

How it is done:
1) Move `test_distributed` tests in `_DistTestBase` class to a shared file `distributed_test` (similar to how the RPC tests are structured)
2) For `Barrier`, refactor the setup of temp directories, as the current version did not work with spawn, each process would get a different randomly generated directory and thus would write to different barriers.
3) Add all the relevant builds to run internally and in OSS.
Running test_distributed with spawn mode in OSS can be done with:
`python test/run_test.py -i distributed/test_distributed_spawn -v`

Reviewed By: izdeby

Differential Revision: D22408023

fbshipit-source-id: e206be16961fd80438f995e221f18139d7e6d2a9
2020-09-08 23:11:12 -07:00