Commit Graph

90 Commits

Author SHA1 Message Date
Howard Huang
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Rodrigo Kumpera
9e1b07e692 [C10d] Handle bool tensors in gloo. Fixes #103585. (#105354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105354
Approved by: https://github.com/wanchaol
2023-07-18 20:42:58 +00:00
Rohan Varma
f044613f78 Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938)
Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-22 21:55:58 +00:00
Ashwin Hari
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
shaoyf42
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
Ke Wen
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
Xiaodong Wang
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
Justin Chu
01abbfbaae [BE] Fix all B022 useless-contextlib-suppress (#100335)
No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335
Approved by: https://github.com/Skylion007
2023-04-30 18:47:40 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
Howard Huang
760967a284 Update _store_based_barrier implementation to reduce load on rank 0 (#98000)
Summary:

Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker
Added optional logging_interval arg to _store_based_barrier

Test Plan:
```
pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier
```

Reviewed By: rohan-varma

Differential Revision: D44430531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000
Approved by: https://github.com/kumpera
2023-04-11 14:25:29 +00:00
Howard Huang
61c74ab0f8 Fix MPI rank and world size pg initialization (#98545)
Fixes https://github.com/pytorch/pytorch/issues/97507

Test command
`pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545
Approved by: https://github.com/malfet
2023-04-07 21:57:31 +00:00
Aaron Gokaslan
5471621497 [BE] Remove unnecessary dict comprehensions (#97116)
Removes unnecessary dict comprehensions that optimize creation of dicts from iterables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116
Approved by: https://github.com/kit1980
2023-03-20 00:56:57 +00:00
Howard Huang
02fa2291f7 Add support for custom backend (#95072)
Fixes https://github.com/pytorch/pytorch/issues/92344

A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`.

Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072
Approved by: https://github.com/kwen2501
2023-03-02 21:41:49 +00:00
Howard Huang
8b3e3f937d Update documentation init_process_group optional backend (#94543)
Update documentation for `init_process_group()` to mention the `backend` argument is optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543
Approved by: https://github.com/kwen2501
2023-02-13 21:45:38 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Howard Huang
2503a4a7c6 Fix MPI backend PG initialization (#92847)
Fixes #92573

Add test to check that all default backends can be initialized to prevent the above from regressing in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847
Approved by: https://github.com/rohan-varma
2023-01-24 23:24:41 +00:00
Shen Li
0035340488 Allow DDP to handle custom dataclass forward outputs (#92334)
Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92334
Approved by: https://github.com/zhaojuanmao
2023-01-18 14:51:37 +00:00
Wanchao Liang
f30694c700 Add allgather_into_tensor to CommTensor (#90565)
This PR adds _all_gather_base_ to CommTensor to support allgather_base
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90565
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Wanchao Liang
b782927ed4 Add reduce_scatter_tensor to CommTensor (#90564)
This PR adds reduce_scatter_base to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90564
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Wanchao Liang
3ba9e4cd55 Add alltoall_ to CommTensor (#90512)
This PR adds alltoall_ to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90512
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
Howard Huang
80150788bc [21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813)
Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813
Approved by: https://github.com/kwen2501
2022-12-08 23:39:26 +00:00
Masaki Kozuki
508916128d [ReduceOp] ameliorate custom __eq__ (#90088)
Improve the completeness of `ReduceOp.__eq__`.

Should support the equal operator with the first argument of `RedOpType` and the second of `ReduceOp` in a follow-up.

Fixes #90072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90088
Approved by: https://github.com/kwen2501
2022-12-06 05:13:50 +00:00
Masaki Kozuki
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
Iris
68fd8f3706 [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)
This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
Approved by: https://github.com/H-Huang
2022-11-15 06:13:17 +00:00
Howard Huang
6e5f736d86 [15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846)
Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846
Approved by: https://github.com/kwen2501
2022-11-12 14:23:45 +00:00
Howard Huang
3a3500fa08 [13/N] Update gather with CPU/CUDA implementations (#86409)
Differential Revision: [D40181612](https://our.internmc.facebook.com/intern/diff/D40181612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86409
Approved by: https://github.com/kwen2501
2022-11-09 22:11:40 +00:00
Howard Huang
55df18e3da [12/N] Update scatter with CPU/CUDA implementations (#86408)
Differential Revision: [D40181613](https://our.internmc.facebook.com/intern/diff/D40181613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86408
Approved by: https://github.com/kwen2501
2022-11-09 18:40:25 +00:00
Howard Huang
81f74eed75 [11/N] Update all_to_all with CPU/CUDA implementations (#86407)
* #83916 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86407
Approved by: https://github.com/kwen2501
2022-11-01 17:54:13 +00:00
Howard Huang
bed8102741 [10/N] Update barrier with CPU/CUDA implementations (#86368)
### Changes
- Updates for the barrier collective
- NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from

### Context
https://github.com/pytorch/pytorch/issues/86225

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86368
Approved by: https://github.com/kwen2501
2022-11-01 17:41:01 +00:00
Howard Huang
20d849b982 [9/N] [Dispatchable Collectives] Update reduce_scatter with CPU / CUDA implementations (#86166)
### Changes
- Updates for the reduce_scatter collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86166
Approved by: https://github.com/kwen2501
2022-11-01 15:23:41 +00:00
Masaki Kozuki
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
Howard Huang
ad449b338f [8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations  (#84423)
### Changes
- Updates for the allgather collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84423
Approved by: https://github.com/kwen2501
2022-10-10 17:18:48 +00:00
Howard Huang
8a1fc5d2f8 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916)
### Changes
- Updates for the reduce collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916
Approved by: https://github.com/kwen2501
2022-10-10 15:58:37 +00:00
Saliya Ekanayake
941d7a31f6 Pass group ranks and options to third party distributed backends (#73164)
Fixes #73163

PyTorch's [_new_process_group_helper()](9f541aa3ac/torch/distributed/distributed_c10d.py (L633)) does not pass group's participating ranks to the backend.

This PR adds the above capability. Also, refactors some variables for better clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164
Approved by: https://github.com/kumpera
2022-09-29 17:28:58 +00:00
Howard Huang
06e0583fb0 [4/N] [Dispatchable Collectives] Update all_reduce_ with CPU / CUDA implementations (#83810)
### About this PR
* Update the all_reduce op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Update test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D39506979](https://our.internmc.facebook.com/intern/diff/D39506979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83810
Approved by: https://github.com/kwen2501
2022-09-28 08:48:32 +00:00
Howard Huang
ccac8d13d5 [3/N] [Dispatchable Collectives] Update broadcast_ with CPU and CUDA implementations (#83735)
### About this PR
* Update the broadcast op to dispatch to cpu and cuda implementations. Right now they both perform the same logic so this is essentially a no-op.
* Add test to validate that a separate device implementation is not supported.

### About this stack
In the future we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. The CPU and CUDA implementations will be updated to have process group select its CPU and CUDA backends respectively.

Differential Revision: [D38876771](https://our.internmc.facebook.com/intern/diff/D38876771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83735
Approved by: https://github.com/kwen2501
2022-09-28 03:24:06 +00:00
Rodrigo Kumpera
7dcc723d35 [c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664)
While passing tensors with different dtypes don't crash, they don't produce sensible results.

We see data tearing instead of casting.

It's not clear we want to support transparent casting so, for now, we fail when such input is presented.

Fixes #84525

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664
Approved by: https://github.com/rohan-varma
2022-09-15 22:32:51 +00:00
Shen Li
1a81ab3ba5 Test tracing consecutive comms on the same input tensor (#84980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84980
Approved by: https://github.com/wanchaol
2022-09-14 17:23:23 +00:00
Shen Li
8cbbd3a25f Avoid nested CommTensor wrapping (#84963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84963
Approved by: https://github.com/wanchaol
2022-09-14 01:22:45 +00:00
Shen Li
2211949513 Moving CommTensor from tests to private _spmd folder (#84719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84719
Approved by: https://github.com/wanchaol
2022-09-09 06:25:42 +00:00
PyTorch MergeBot
a6e6276c8b Revert "Moving CommTensor from tests to private _spmd folder (#84655)"
This reverts commit 07dad15583.

Reverted https://github.com/pytorch/pytorch/pull/84655 on behalf of https://github.com/kit1980 due to Several test failures on trunk 07dad15583, PR also had failures
2022-09-08 19:28:38 +00:00
Shen Li
07dad15583 Moving CommTensor from tests to private _spmd folder (#84655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84655
Approved by: https://github.com/wanchaol
2022-09-08 17:25:38 +00:00
Shen Li
89c4654ba9 Add scatter_ to CommTensor (#84606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84606
Approved by: https://github.com/wanchaol
2022-09-07 14:00:20 +00:00
Shen Li
f43c38bdc8 Add broadcast_ to CommTensor (#84604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84604
Approved by: https://github.com/wanchaol
2022-09-07 14:00:20 +00:00