Commit Graph

259 Commits

Author SHA1 Message Date
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
Edward Z. Yang
de3ae93e9b Include rank of default PG in C++ log messages (#110623)
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.

The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.

This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-10 00:26:52 +00:00
Kazuaki Ishizaki
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
Howard Huang
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
Rohan Varma
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
Rodrigo Kumpera
c26270c733 [C10D] Even more store scalability work. (#109218)
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.

Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.

Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
2023-09-22 21:27:09 +00:00
Howard Huang
600d0d0284 Add "cuda" to MPI backend capabilities (#109614)
Summary: Fixes https://github.com/pytorch/pytorch/issues/109543

Test Plan: We need to run CUDA aware MPI in PyTorch to actually test this change, we currently have no MPI tests.

Differential Revision: D49420438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109614
Approved by: https://github.com/XilunWu
2023-09-21 13:34:58 +00:00
Rodrigo Kumpera
881bfbf21d [c10d] Add tests for usig libuv through init_process_group. (#108661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108661
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-09-20 16:02:20 +00:00
Rodrigo Kumpera
2bca5f2af7 [C10D] Track pg name in c++. (#108813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813
Approved by: https://github.com/wconstab
2023-09-15 01:10:29 +00:00
Brian Vaughan
bb14805bcd fix an incorrect indent in documentation (#108273)
doc for `torch.distributed.send(tensor, dst, group=None, tag=0)` was rendering incorrectly here: https://pytorch.org/docs/stable/distributed.html due to lack of indent (it was interpreting the continuation as a new argument).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108273
Approved by: https://github.com/awgu, https://github.com/kit1980
2023-09-11 21:27:52 +00:00
Pritam Damania
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
PyTorch MergeBot
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
Pritam Damania
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
wz337
264df88a2d [C10D][Logger]Add more info to c10d logger (#107331)
This PR adds pg_name and world_size to c10d logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107331
Approved by: https://github.com/kumpera
2023-08-28 15:10:56 +00:00
Codle
42738c56a0 Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509)
The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element.

Before (30G tensor):
<img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944">

After (46G tensor):
<img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5">

Test Code:
```python
if __name__ == "__main__":
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())

    fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4)

    if dist.get_rank() == 0:
        state_dict = {"fake_tensor": fake_tensor}
    else:
        state_dict = {}
    object_list = [state_dict]
    dist.broadcast_object_list(object_list, src=0)
    print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys())
    dist.barrier()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509
Approved by: https://github.com/awgu
2023-08-23 17:19:10 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
Rodrigo Kumpera
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
Shen Li
45128ab67c [Reland] Add OnCompletion Hook to ProcessGroup (#106988) (#107233)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233
Approved by: https://github.com/kumpera
2023-08-15 17:35:14 +00:00
PyTorch MergeBot
fd214aa8be Revert "Add OnCompletion Hook to ProcessGroup (#106988)"
This reverts commit ba1da47e8f.

Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error.  The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))
2023-08-15 08:24:33 +00:00
Shen Li
ba1da47e8f Add OnCompletion Hook to ProcessGroup (#106988)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988
Approved by: https://github.com/kumpera, https://github.com/H-Huang
ghstack dependencies: #107140, #107141, #107160
2023-08-15 04:32:23 +00:00
Bruce Jiang
2624da638d Support third-party devices to use the init_process_group method with… (#107113)
…out specifying the Backend

When init_process_group is not been done before, it will automatically apply  init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113
Approved by: https://github.com/wanchaol
2023-08-15 03:46:07 +00:00
Jirka
858b465d74 fix str splits in single line (#106005)
Simple formating improvement and two spell fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005
Approved by: https://github.com/H-Huang
2023-08-14 23:07:38 +00:00
Michael Voznesensky
42660015b4 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886
Approved by: https://github.com/awgu, https://github.com/wconstab
ghstack dependencies: #106884
2023-08-11 22:35:50 +00:00
Louis Feng
3a01c056f5 [PyTorch][ET] Collect Process Groups Mapping Info (#104373)
Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info).

Test Plan:
Testing in HPC:
```
TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par  +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000
```
Example output in ET:
```
    {
      "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Differential Revision: D46321690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373
Approved by: https://github.com/kwen2501
2023-07-25 03:34:53 +00:00
Howard Huang
0ab74044c2 [BE] remove deprecated attributes from distributed_c10d (#105753)
Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now.

Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753
Approved by: https://github.com/rohan-varma
2023-07-24 16:35:08 +00:00
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Ke Wen
22e8a61d9b Implement coalesced reduce_scatter_tensor (#103561)
Map of #101157.

This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax:

Sync communication style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
```

Async communication style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the reduce-scatters' results
```
Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561
Approved by: https://github.com/fegin
2023-06-15 20:11:12 +00:00
zhuhong61
50c972bfd2 [c10d] Add xpu to the default device supported by user specified backend (#103410)
**Motivation:**
For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping.

**Solution:**
We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly.
Usage:
When using xpu device, user can specify backend name only:
`dist.init_process_group(backend='ccl')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-12 19:46:33 +00:00
Ke Wen
07104ca99c [c10d] Make it default that PG do not perform barrier after init (#103033)
Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033
Approved by: https://github.com/XilunWu
2023-06-07 06:11:14 +00:00
Ashwin Hari
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
shaoyf42
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
Wanchao Liang
3ef4d697df [c10d] default backend need to check for nccl availability (#102470)
As titled, we can only initialize nccl backend when NCCL is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2023-05-30 19:22:37 +00:00
Wanchao Liang
7b47cd0a6c [c10d] add fake pg necessary collectives (#102238)
This PR adds fake pg necessary collectives to enable e2e FSDP run
with out multiprocess or multithreading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Wanchao Liang
9a19262556 [c10d] conslidate barrier after init logic (#102237)
This PR consolidates the barrier after init logic to allow custom
backend to set the env var when creating the pg, so that
`init_process_group` would skip barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Edward Z. Yang
c903b12cb8 Add fake process group (#102180)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180
Approved by: https://github.com/wanchaol
2023-05-24 23:27:40 +00:00
Iris
ee95e37a69 [c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912)
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
2023-05-24 09:36:34 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
shaoyf42
97180aca5e Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919

Today, there are two limitations of barrier:
One is that barrier does not support custom  #device:
fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)

The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501
2023-05-17 05:26:04 +00:00
Ke Wen
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
Ke Wen
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
Rodrigo Kumpera
a204f7f518 [c10d] Fix subprocess group handlig in scatter_object_list. (#100552)
scatter_object_list assumed src was a group rank while all collectives use global ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552
Approved by: https://github.com/fduwjj
2023-05-04 10:04:21 +00:00
Xiaodong Wang
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
Animesh Jain
5fbb40669f [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071
Approved by: https://github.com/jansel, https://github.com/H-Huang
2023-05-02 20:09:25 +00:00
Ke Wen
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00