Commit Graph

31 Commits

Author SHA1 Message Date
Hari Krishna Sai Kodali
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
Xuehai Pan
db3290846e [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761
Approved by: https://github.com/fegin
2024-07-17 16:57:39 +00:00
Yuanhao Ji
e3effa5855 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-17 06:46:02 +00:00
PyTorch MergeBot
52be63eb2c Revert "Enable UFMT on all of test/distributed (#123539)"
This reverts commit 89ac37fe91.

Reverted https://github.com/pytorch/pytorch/pull/123539 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123539#issuecomment-2058329471))
2024-04-16 06:33:21 +00:00
Yuanhao Ji
89ac37fe91 Enable UFMT on all of test/distributed (#123539)
Partially addresses #123062

Ran lintrunner on:

- `test/distributed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123539
Approved by: https://github.com/ezyang
2024-04-16 03:23:56 +00:00
Tristan Rice
358ace1a1b functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599)
This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs.

This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions.

This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering.

To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`.

Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py

Test plan:

```
pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile
pytest test/distributed/test_functional_api.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599
Approved by: https://github.com/yifuwang
2024-04-12 01:48:49 +00:00
Andrew Gu
4cf6d1172b [FSDP2] Used ReduceOp.AVG if fp32 reduce-scatter (#120919)
This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
ghstack dependencies: #120238, #120910
2024-03-02 00:39:16 +00:00
Rodrigo Kumpera
02cd971e95 [C10D] Improve MTPG autograd test. Fixes #105106 (#105356)
Explicitly asserts that bwd is running from the same thread as fwd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105356
Approved by: https://github.com/rohan-varma, https://github.com/wanchaol, https://github.com/fduwjj
2023-07-20 13:51:21 +00:00
Rodrigo Kumpera
246dc0d9f2 [MTPG] Use TLS propagation to enable MTPG from bwd. (#104735)
We use PyTorch's built-in tls propagation in ThreadLocalState to forward the world object
from the fwd thread to the bwd thread.

This further closes the gap on enabling FSDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104735
Approved by: https://github.com/rohan-varma
2023-07-12 18:47:02 +00:00
Iris
15eed5b73e [Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568)
This test(8340762211/test/distributed/test_multi_threaded_pg.py (L133) ) is failing on internal sandbox with the following error msg:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
    raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
 exiting thread 1
ERROR
```

Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0

We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.

cc. @kumpera @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568
Approved by: https://github.com/H-Huang
2023-06-18 07:05:28 +00:00
Rodrigo Kumpera
5b4a523583 Add all_reduce_coalesced to functional collectives (#98640)
This adds all_reduce_coalesced to MTPG to ease testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98640
Approved by: https://github.com/wanchaol
2023-04-26 17:05:54 +00:00
Xilun Wu
89894115ab [MTPG] add all_to_all collective to MTPG (#98791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98791
Approved by: https://github.com/kumpera
2023-04-11 21:35:45 +00:00
Rodrigo Kumpera
8177081848 Add gather to MTPG (#97555)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97555
Approved by: https://github.com/H-Huang
2023-03-27 19:37:02 +00:00
Rodrigo Kumpera
1d3c394d5e [MTPG] Improve all_reduce and handle bwd thread support (#95524)
This implements all reduce ops in all_reduce and a PG being used from a thread different than the one that created it.

We should be this >< close to getting complex training tests working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95524
Approved by: https://github.com/H-Huang
2023-03-03 18:53:36 +00:00
Wanchao Liang
e16979c9a0 [threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650)
This PR did a full rewrite of MultiThreadedTestCase, to make it more
aligned with the MultiProcessTestCase, also changed how it do spawning
and testing, so that we could embed thread local states when running
tests.

This PR enables device_type tests to work with MultiThreadedTestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650
Approved by: https://github.com/XilunWu
2023-01-17 03:26:36 +00:00
Wanchao Liang
9942ddd5b3 [threaded_pg] enable subpg creation and concurrent collective (#91649)
This PR refactors the threaded PG logic to enable multiple sub pg
creation under the world threaded pg, and allow the case where
we can call collectives together on different subpgs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91649
Approved by: https://github.com/XilunWu
2023-01-17 03:26:34 +00:00
Xilun Wu
a6dcebf997 [threaded pg] make exception handling consistent with MultiProcessTestCase (#90712)
Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
Xilun Wu
34da446072 [threaded pg] add assertion util to MultiThreadedTestCase (#90595)
Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
Xilun Wu
3759777edc [threaded PG] fix long hang issue in testing (#90515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515
Approved by: https://github.com/wanchaol
2022-12-09 05:24:08 +00:00
Xilun Wu
91dcef41ae Thread PG: add allreduce to threaded pg (#89043)
Summary:
Goal
Add `all_reduce` collective  to multi-threaded ProcessGroup added in D40236769 (6663ae5537).

Code Motion
Added `allreduce` collective to ProcessLocalGroup (a subclass of c10d ProcessGroup).

What's Next
Add a DDP test utilizing the new allreduce op.
Generalize `allreduce` to allow other `ReduceOp`s besides `SUM`.

Test Plan:
cd fbcode/caffe2
buck2 test mode/dev //caffe2/test/distributed:multi_threaded

Differential Revision: D41046606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89043
Approved by: https://github.com/wanchaol
2022-11-23 19:43:30 +00:00
Wanchao Liang
821ba6b51b [4/n] Thread PG: add reduce_scatter to threaded pg (#89442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89442
Approved by: https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:44 +00:00
Wanchao Liang
3e99d4db76 [3/n] Thread PG: add scatter to threaded pg (#89441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89441
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:44 +00:00
Wanchao Liang
3876f94c3d [2/n] Thread PG: add test for broadcast (#89440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89440
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:42 +00:00
Wanchao Liang
deae450899 [1/n] Thread PG: add test for allgather (#89439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89439
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:41 +00:00
Charlie Yan
ee05f47bdd Rebase and re-land thread PG (#88795)
The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795
Approved by: https://github.com/huydhn, https://github.com/wanchaol
2022-11-15 21:58:58 +00:00
PyTorch MergeBot
c7fc710459 Revert "[3/n] Thread PG: add threaded PG implementation (#88627)"
This reverts commit 6dd081846e.

Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test 6dd081846e in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged
2022-11-09 22:38:41 +00:00
Charlie Yan
6dd081846e [3/n] Thread PG: add threaded PG implementation (#88627)
Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation.

Test Plan: TBD

Reviewed By: XilunWu

Differential Revision: D40992593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-11-09 20:51:11 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
PyTorch MergeBot
6fae62b35f Revert "C10D extension to enable per-thread PG (#84153)"
This reverts commit 5cbffbbac9.

Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff
2022-09-29 13:51:05 +00:00
Rodrigo Kumpera
5cbffbbac9 C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
2022-09-27 21:42:31 +00:00