Commit Graph

430 Commits

Author SHA1 Message Date
Howard Huang
0ab74044c2 [BE] remove deprecated attributes from distributed_c10d (#105753)
Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now.

Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753
Approved by: https://github.com/rohan-varma
2023-07-24 16:35:08 +00:00
Justin Chu
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
Ke Wen
22e8a61d9b Implement coalesced reduce_scatter_tensor (#103561)
Map of #101157.

This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax:

Sync communication style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
```

Async communication style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the reduce-scatters' results
```
Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561
Approved by: https://github.com/fegin
2023-06-15 20:11:12 +00:00
zhuhong61
50c972bfd2 [c10d] Add xpu to the default device supported by user specified backend (#103410)
**Motivation:**
For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping.

**Solution:**
We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly.
Usage:
When using xpu device, user can specify backend name only:
`dist.init_process_group(backend='ccl')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-12 19:46:33 +00:00
Ke Wen
07104ca99c [c10d] Make it default that PG do not perform barrier after init (#103033)
Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033
Approved by: https://github.com/XilunWu
2023-06-07 06:11:14 +00:00
Ashwin Hari
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
shaoyf42
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
Wanchao Liang
3ef4d697df [c10d] default backend need to check for nccl availability (#102470)
As titled, we can only initialize nccl backend when NCCL is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2023-05-30 19:22:37 +00:00
Wanchao Liang
7b47cd0a6c [c10d] add fake pg necessary collectives (#102238)
This PR adds fake pg necessary collectives to enable e2e FSDP run
with out multiprocess or multithreading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Wanchao Liang
9a19262556 [c10d] conslidate barrier after init logic (#102237)
This PR consolidates the barrier after init logic to allow custom
backend to set the env var when creating the pg, so that
`init_process_group` would skip barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
Edward Z. Yang
c903b12cb8 Add fake process group (#102180)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180
Approved by: https://github.com/wanchaol
2023-05-24 23:27:40 +00:00
Iris
ee95e37a69 [c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912)
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
2023-05-24 09:36:34 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
shaoyf42
97180aca5e Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919

Today, there are two limitations of barrier:
One is that barrier does not support custom  #device:
fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)

The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501
2023-05-17 05:26:04 +00:00
Ke Wen
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
Ke Wen
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
Rodrigo Kumpera
a204f7f518 [c10d] Fix subprocess group handlig in scatter_object_list. (#100552)
scatter_object_list assumed src was a group rank while all collectives use global ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552
Approved by: https://github.com/fduwjj
2023-05-04 10:04:21 +00:00
Xiaodong Wang
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
Animesh Jain
5fbb40669f [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071
Approved by: https://github.com/jansel, https://github.com/H-Huang
2023-05-02 20:09:25 +00:00
Ke Wen
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
Rodrigo Kumpera
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Ke Wen
3a09aa5977 [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-24 21:27:26 +00:00
medivh-xp
39590d06c5 Make new_subgroups avaliable for non-cuda depend backend (#99706)
The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend):
```python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def gloo_process(rank_id, world_size, group_size, mp_lock):
    assert not torch.cuda.is_available()
    def lock_print(*args, **kwargs):
        with mp_lock:
            print(*args, **kwargs, flush=True)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('gloo', rank=rank_id, world_size=world_size)

    subgroup, _ = dist.new_subgroups(group_size)
    subgroup_ranks = list(range(subgroup.rank() * group_size, (subgroup.rank() + 1) * group_size))
    lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}")

    tensor = torch.Tensor([rank_id + 1])
    subgroup.broadcast(tensor, root=0)

    lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}")

if __name__ == "__main__":
    world_size = 4
    group_size = 2
    processes = []
    mp.set_start_method("spawn")
    mp_lock = mp.Lock()
    for rank in range(world_size):
        p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
```

```bash
Rank 0 assigned to subgroup_0: [0, 1]
Rank 1 assigned to subgroup_1: [2, 3]
Rank 2 assigned to subgroup_0: [0, 1]
Rank 3 assigned to subgroup_1: [2, 3]
After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.])
After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.])
After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.])
After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706
Approved by: https://github.com/kumpera
2023-04-24 18:22:59 +00:00
PyTorch MergeBot
9861ec9785 Revert "[c10d] Faster coalescing (#98793)"
This reverts commit db456ab83d.

Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-21 09:15:04 +00:00
Ke Wen
db456ab83d [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-19 20:17:58 +00:00
Howard Huang
760967a284 Update _store_based_barrier implementation to reduce load on rank 0 (#98000)
Summary:

Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker
Added optional logging_interval arg to _store_based_barrier

Test Plan:
```
pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier
```

Reviewed By: rohan-varma

Differential Revision: D44430531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000
Approved by: https://github.com/kumpera
2023-04-11 14:25:29 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Howard Huang
61c74ab0f8 Fix MPI rank and world size pg initialization (#98545)
Fixes https://github.com/pytorch/pytorch/issues/97507

Test command
`pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545
Approved by: https://github.com/malfet
2023-04-07 21:57:31 +00:00
Rohan Varma
8a29afe98a [RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702)
Using GPU tensors in these collectives have caused SEVs, user
confusion, and slowness in the past. These APIs were only designed to
communicate arbitrary python objects, and GPU tensors should either be copied
to CPU first or use the regular collecitves. Add a warning indicating so.

Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702
Approved by: https://github.com/kumpera
2023-04-06 23:47:35 +00:00
Howard Huang
3b6e94cb8c [small] replace with .format() with f-strings (#98514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514
Approved by: https://github.com/awgu
2023-04-06 18:58:56 +00:00
Kazuaki Ishizaki
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Kazuaki Ishizaki
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
Howard Huang
ac7329b323 Add exceptionhandler to more distributed_c10d APIs (#96770)
Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table

Test Plan: sandcastle

Differential Revision: D44068557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770
Approved by: https://github.com/wz337
2023-03-15 20:31:46 +00:00
Howard Huang
02fa2291f7 Add support for custom backend (#95072)
Fixes https://github.com/pytorch/pytorch/issues/92344

A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`.

Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072
Approved by: https://github.com/kwen2501
2023-03-02 21:41:49 +00:00
Howard Huang
c0fa0669f6 Update isend/irecv warning messages for nccl (#95236)
Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it.

Example:

<img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png">

Differential Revision: D43475756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-02-22 22:00:13 +00:00
Rodrigo Kumpera
641cb4243c Fix c10d regression during cleanup. (#94988)
This fixes a regression introduced earlier today with a change to c10d global state.

It must be cleaned up in destroy_process_group or root PG and its Store will stay alive.

Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988
Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet
2023-02-16 19:12:00 +00:00
Rodrigo Kumpera
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Howard Huang
8b3e3f937d Update documentation init_process_group optional backend (#94543)
Update documentation for `init_process_group()` to mention the `backend` argument is optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543
Approved by: https://github.com/kwen2501
2023-02-13 21:45:38 +00:00
Howard Huang
f45c196653 Update backend config to be under _World (#94191)
All the c10d process group state is under `_World`, so this is BE work to include a missing map
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191
Approved by: https://github.com/kumpera
2023-02-09 20:48:42 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
Iris
f54fd6fb28 [c10d] Update get_backend() in exception_handler (#94063)
Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063
Approved by: https://github.com/H-Huang
2023-02-04 19:39:36 +00:00
Ching-Hsiang Chu
1fa68d40b8 [pytorch] fix backend_type for backend/PG plugin (#93129)
Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later

Test Plan: CI/CD and internal tests

Differential Revision: D42793222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129
Approved by: https://github.com/H-Huang
2023-01-30 23:16:08 +00:00
Howard Huang
2503a4a7c6 Fix MPI backend PG initialization (#92847)
Fixes #92573

Add test to check that all default backends can be initialized to prevent the above from regressing in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847
Approved by: https://github.com/rohan-varma
2023-01-24 23:24:41 +00:00
Andrew Gu
cb67d9460b [PT-D] Fix send, recv return type (#92152)
- `send` returns `None`.
- `recv` returns the sender rank if valid or -1 otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152
Approved by: https://github.com/wz337
2023-01-14 01:09:49 +00:00
joncrall
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
Howard Huang
99aec69f58 [BE] remove Backend.TCP (#91314)
Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314
Approved by: https://github.com/awgu
2022-12-23 15:48:29 +00:00
Sergii Dymchenko
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
Sadra Barikbin
97f514f38e Fix two typos in torch.distributed.distributed_c10d.py::broadcast_object_list (#91237)
Fixes #91236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237
Approved by: https://github.com/malfet, https://github.com/H-Huang
2022-12-21 19:45:08 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Rohan Varma
793a999ce0 Hybrid Sharded Data Parallel (#89915)
Adds 2 new hybrid sharding strategy to FSDP:
1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across

These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.

Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.

** Acknowledgements **
- @awgu 's excellent prototype: 5ad3a16d48
- @liangluofb For ideation, feedback, and initial implementation and experimentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915
Approved by: https://github.com/awgu
2022-12-08 16:18:03 +00:00
Howard Huang
ee907375fa [small] Update error message (#89294)
Summary:
`RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].`

to

`RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].`

Test Plan: sandcastle

Differential Revision: D41405238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294
Approved by: https://github.com/awgu
2022-11-19 00:21:14 +00:00
Masaki Kozuki
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
Iris
68fd8f3706 [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)
This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
Approved by: https://github.com/H-Huang
2022-11-15 06:13:17 +00:00
Howard Huang
1d54ce9d5d [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351)
Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351
Approved by: https://github.com/kwen2501
2022-11-10 19:27:17 +00:00
Kurt Mohler
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
Rodrigo Kumpera
6663ae5537 [2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/781

Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.

I have relinked this diff to a new github PR, so that I can update it. The original PR is
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348

Differential Revision: D40236769

Pulled By: yhcharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
Approved by: https://github.com/gnadathur, https://github.com/rohan-varma
2022-11-07 17:56:40 +00:00
Kazuaki Ishizaki
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
Iris Zhang
0cf572ff6c [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

1. Add a decorator function exception_handlers to  c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

```
python3 test/distributed/test_c10d_error_logger.py
```

Test Plan: Test in OSS.

Reviewed By: H-Huang

Differential Revision: D40281632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
2022-10-29 04:38:34 +00:00
Masaki Kozuki
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
Louis Feng
55479fe80e Enable capturing of comm collective parameters (#98) (#85368)
Summary:
X-link: https://github.com/facebookresearch/torch_ucc/pull/98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
Approved by: https://github.com/H-Huang
2022-10-11 04:38:26 +00:00
Jesus Magana
c670bad72f Update dist.scatter() documentation (#86069)
Update documentation for dist. scatter

Fixes #84566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86069
Approved by: https://github.com/rohan-varma, https://github.com/H-Huang
2022-10-03 17:22:08 +00:00
Ke Wen
05d1128106 [c10d] Start deprecating *_multigpu APIs (#85961)
### Deprecation reasons:
- For most users training is on one GPU per process so these APIs are rarely used
- They added one more API dimension
- They can be expressed in a composed manner
- They are not abstracted – specific to GPU
- They caused backend APIs and implementations to have nested `std::vector<std::vector<Tensor>>`, which is hard to read or maintain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85961
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:59:39 +00:00
Ke Wen
463283e016 [c10d] Start deprecating *_coalesced APIs (#85959)
- We consider that general users need not to use the `*_coalesced` APIs unless there is an extreme concern about performance.

- We are investigating using a context manager named `coalescing_manager` which wrap around multiple individual collectives to compose the coalescing hint, rather than giving each collective a *_coalesced variant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85959
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-10-01 00:55:27 +00:00
Ke Wen
ade1c19612 Add reduce_scatter_tensor in place of _reduce_scatter_base (#85867)
This is a twin PR similar to the one for `all_gather_into_tensor` (#85686).
The philosophy for renaming `_reduce_scatter_base` instead of merging it is described in #85686.

Cc @rohan-varma @H-Huang @crcrpar @ptrblck @mrshenli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85867
Approved by: https://github.com/crcrpar, https://github.com/H-Huang
2022-09-30 05:48:16 +00:00
Saliya Ekanayake
941d7a31f6 Pass group ranks and options to third party distributed backends (#73164)
Fixes #73163

PyTorch's [_new_process_group_helper()](9f541aa3ac/torch/distributed/distributed_c10d.py (L633)) does not pass group's participating ranks to the backend.

This PR adds the above capability. Also, refactors some variables for better clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73164
Approved by: https://github.com/kumpera
2022-09-29 17:28:58 +00:00
PyTorch MergeBot
6fae62b35f Revert "C10D extension to enable per-thread PG (#84153)"
This reverts commit 5cbffbbac9.

Reverted https://github.com/pytorch/pytorch/pull/84153 on behalf of https://github.com/kumpera due to broke internal stuff
2022-09-29 13:51:05 +00:00
Ke Wen
775a22c7c6 Add all_gather_into_tensor in place of _all_gather_base (#85686)
### Description
- This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning.
- The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors.
- This PR also adds deprecation warning to `_all_gather_base`.

### Issue
`_all_gather_base` was implemented in https://github.com/pytorch/pytorch/pull/33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output.

There are, however, two "blockers" that make the merge difficult:
(i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list.
(ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information.

In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name.

### Testing
Added tests:
- `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example:
```
        >>> tensor_in
        tensor([1, 2], device='cuda:0') # Rank 0
        tensor([3, 4], device='cuda:1') # Rank 1
        >>> tensor_out
        tensor([1, 2, 3, 4], device='cuda:0') # Rank 0
        tensor([1, 2, 3, 4], device='cuda:1') # Rank 1
```
- `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example:
```
        >>> tensor_out2
        tensor([[1, 2],
                [3, 4]], device='cuda:0') # Rank 0
        tensor([[1, 2],
                [3, 4]], device='cuda:1') # Rank 1
```
The output form is determined by the shape of the output tensor passed by the user, no flag used.

Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85686
Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
2022-09-27 22:50:22 +00:00
Rodrigo Kumpera
5cbffbbac9 C10D extension to enable per-thread PG (#84153)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84153
Approved by: https://github.com/rohan-varma
2022-09-27 21:42:31 +00:00
Rodrigo Kumpera
7dcc723d35 [c10d] Ensure collectives are called with the same dtype for all tensor params. (#84664)
While passing tensors with different dtypes don't crash, they don't produce sensible results.

We see data tearing instead of casting.

It's not clear we want to support transparent casting so, for now, we fail when such input is presented.

Fixes #84525

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84664
Approved by: https://github.com/rohan-varma
2022-09-15 22:32:51 +00:00
Salahuddin
6bd7d0f856 doc string fixed in torch.distributed.reduce_scatter (#84983)
Fixes #84865

Previous `torch.distributed.reduce_scatter`:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Fixed:

```
def reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False):
    """
    Reduces, then scatters a list of tensors to all processes in a group.

    Args:
        output (Tensor): Output tensor.
        input_list (list[Tensor]): List of tensors to reduce and scatter.
        op (optional): One of the values from
            ``torch.distributed.ReduceOp``
            enum.  Specifies an operation used for element-wise reductions
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84983
Approved by: https://github.com/H-Huang
2022-09-15 18:17:10 +00:00
Rodrigo Kumpera
38192f63cd Add __all__ for a few distributed modules plus a little typing (reland) (#84872)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

This relands #84119 with the required fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872
Approved by: https://github.com/rohan-varma
2022-09-13 21:57:49 +00:00
PyTorch MergeBot
219ff26172 Revert "Add __all__ for a few distributed modules plus a little typing (#84119)"
This reverts commit 6f21680563.

Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448
2022-09-09 20:01:07 +00:00
Rodrigo Kumpera
6f21680563 Add __all__ for a few distributed modules plus a little typing (#84119)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119
Approved by: https://github.com/rohan-varma
2022-09-08 23:28:31 +00:00
Rodrigo Kumpera
e96fb5d58c [c10d] Fix docstring of scatter_object_list (#84596)
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84596
Approved by: https://github.com/H-Huang
2022-09-07 14:49:45 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
Rodrigo Kumpera
7a348a1d4a Fix internal breakage caused by #82134 (#84363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84363
Approved by: https://github.com/rohan-varma, https://github.com/mehtanirav
2022-09-01 17:54:10 +00:00
Rodrigo Kumpera
65dc5dd3f3 [c10d] Introduce dist.get_local_rank, dist.get_global_rank and dist.get_global_ranks (#82134)
Those functions enable membership introspection into a ProcessGroup. A common scenario
that needs this is library code that consumes a PG but doesn't create it, which means
it likely doesn't know the global ranks used to create it.

Translating from local to global is necessary when using c10d collectives like broadcast
so if your library code adopts the convention of using local rank 0, it needs
to the following:

```python
import torch.distributed as dist

my_pg: dist.ProcessGroup = ...

def my_library_bcast(tensor)
    dist.broadcast(tensor, src=dist.get_global_rank(my_pg, local_rank=0), my_pg)

```

This implements some of the helpers needed to implement the `clone` API from: https://github.com/pytorch/pytorch/issues/81291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82134
Approved by: https://github.com/rohan-varma
2022-08-30 17:45:00 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
joncrall
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
Xiang Gao
cda210e23b UCC PG build in CI (#81583)
- Modifies the current cmake build definitions to use `find_package` to find UCX and UCC installed in the system
- Install UCX and UCC in CUDA dockers
- Build PyTorch with `USE_UCC=1` in pipelines
- Currently, we are not running unit tests with the UCC PG. Those tests will be added in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81583
Approved by: https://github.com/vtlam, https://github.com/malfet
2022-08-10 00:23:47 +00:00
Aashaka Shah
24a084eda6 [c10d] Fix async error in batch_isend_irecv (#82450)
Summary:
`batch_isend_irecv` previously required the use of `torch.cuda.synchronize` to avoid data race conditions. This was because the ncclStreams were recorderd in the returned ncclWork object _before_ a ncclGroupEnd by the `_batch_p2p_manager` was issued. Thus, the `req.wait()` was effectively waiting on nothing, leading to the later operators working on incorrect intermediate data.

This fix:
- keeps track of ncclStreams to wait on, and records them in the work objects after the batch manager issues a ncclGroupEnd
- renames the `_batch_p2p_manager` to `_coalescing_manager` for generality
- removes the explicit check for NCCL backend inside `_batch_p2p_manager` in distributed_c10.py and moves the manager start/end to ProcessGroup.hpp, in order to transparently work with all process groups

Test Plan: Modified the unittest for `batch_isend_irecv` to check that received tensors are the same as expected tensors. Verified that the test fails before the change, and passes after the change.

Differential Revision: D38100789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82450
Approved by: https://github.com/kwen2501
2022-08-08 17:50:22 +00:00
ProGamerGov
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
Terry Lam
54bdaf76d6 [PFC] Native UCC process group for Pytorch (#79918)
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.

Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.

Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:

$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded

Differential Revision: D36973688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
2022-07-12 14:45:44 +00:00
Nikita Shulga
09df27fe45 Revert "Revert "[distributed] Handle object collectives and NCCL. (#79034)""
This reverts commit 279634f384.
2022-06-15 10:04:37 -07:00
PyTorch MergeBot
279634f384 Revert "[distributed] Handle object collectives and NCCL. (#79034)"
This reverts commit 4ebb326b75.

Reverted https://github.com/pytorch/pytorch/pull/79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-06-15 16:16:21 +00:00
Rodrigo Kumpera
4ebb326b75 [distributed] Handle object collectives and NCCL. (#79034)
This fixes all object collectives under NCCL and adds some automated tests for them.

This PR *does not* fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79034
Approved by: https://github.com/rohan-varma
2022-06-13 19:23:39 +00:00
Michael Suo
fb0f285638 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 20:51:34 +00:00
PyTorch MergeBot
3d7428d9ac Revert "[lint] upgrade mypy to latest version"
This reverts commit 9bf18aab94.

Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo
2022-05-03 20:01:18 +00:00
Michael Suo
9bf18aab94 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 19:43:28 +00:00
Ke Wen
d6f22abbcc [PyTorch Distributed] Fix batch_isend_irecv
Summary:
`batch_isend_irecv` previously only worked for two-rank cases,
otherwise it would hang, e.g. pytorch/pytorch#73960. This Diff extends
`batch_isend_irecv` to support more than two ranks. The fix is by treating
the operation more like a collective rather than two-rank P2P when selecting
the communicator, since there can be more ranks participating in the batch call than "my" rank and "my" peer.

Rules:
- If `batch_isend_irecv` is the first collective call (including collectives and
  all-to-all) in the `group` given as the argument, then all ranks of the
  `group` are expected to participate in this call.
- Otherwise, if it is not the first collective call in the `group` (i.e. the
  communicator has been initialized), then batched P2P communication involving
  only subset of processes of the `group` is allowed.

Test Plan:
Added p2p_tests.py testing the following patterns:
+    sendrecv_neighbor(input, output)       # Ring like neighbor exchange
+    sendrecv_ripple(input, output)         # Exchange with growing distance (pytorch/pytorch#73960)
+    sendrecv_P2P(input, output)            # Single P2P operation
+    isendrecv_P2P(input, output)           # Single non-blocking P2P operation
+    isendrecv_P2P_batch(input, output, 0)  # batched P2P between only two ranks
+    isendrecv_P2P_batch(input, output, 1)  # batched P2P within a new group created for two ranks

Differential Revision: D35122664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/74701
Approved by: https://github.com/mingzhe09088, https://github.com/osalpekar
2022-04-13 05:55:00 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Yuxin Wu
1ed4653e89 Stop writing logs to root logger (#72649)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/72648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649

Reviewed By: soulitzer

Differential Revision: D34172113

Pulled By: mrshenli

fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf
(cherry picked from commit c14297cee6)
2022-02-11 21:30:53 +00:00
Rohan Varma
678c08bb55 [PG Wrapper] Small fix (#72657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72657

_ProcessGroupWrapper check needs to be gated on Gloo availability,
this fails when gloo is not avail_ProcessGroupWrapper check needs to be gated
on Gloo availability, this fails when gloo is not avail.
ghstack-source-id: 148837056

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D34144848

fbshipit-source-id: 42a04918b968247f3259cd2cde5438e1265b04fe
(cherry picked from commit ba5de98939)
2022-02-11 15:59:13 +00:00
Wanchao Liang
8551989bff [c10d] Enable gather_object on nccl (#71623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71623

Enable gather_object on the nccl backend, since we already support `dist.gather` on nccl. This requires user to set the current device properly.
ghstack-source-id: 147754836

Test Plan: distributed_nccl_spawn -r test_gather_object

Reviewed By: zou3519

Differential Revision: D33701042

fbshipit-source-id: 39cff22947a7cac69d0c923b956dc10f25353a6f
(cherry picked from commit 6e6eff497f)
2022-01-27 14:59:55 -08:00
Shen Li
7bc220e060 Update distributed.rst for ProcessGroup Extensions (#71482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71482

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D33745986

Pulled By: mrshenli

fbshipit-source-id: fe2d0491901bf00be09deb5c556bc1e1d359b725
(cherry picked from commit be5104bfd7)
2022-01-25 00:30:08 +00:00
Stephan Uphoff
e1e43c4e71 Prevent sum overflow in broadcast_object_list (#70605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Add a Tensor  with >2GB storage requirement (in distributed_test.py) to object broadcast.

This Tensor is only added if test are running at Meta as github tests will oom.

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object
buck test  mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object

Reviewed By: r-barnes

Differential Revision: D33405741

fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1
2022-01-05 09:07:39 -08:00
Michael Suo
b7b32b56f1 Revert D33281300: Prevent sum overflow in broadcast_object_list
Test Plan: revert-hammer

Differential Revision:
D33281300 (807f9a828c)

Original commit changeset: 1bc83e8624ed

Original Phabricator Diff: D33281300 (807f9a828c)

fbshipit-source-id: beb81a9cbfba405a61b11dfaa8e39c9601f45643
2021-12-27 19:01:53 -08:00
Stephan Uphoff
807f9a828c Prevent sum overflow in broadcast_object_list (#70336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336

broadcast_object_list casted the sum of all object lengths to int from long causing overflows.

Test Plan:
Increased size of Tensor used in object transfers to have  >2GB storage requirement (in distributed_test.py)

Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.

Test used on server with GPUs:

buck test  mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object

Differential Revision: D33281300

fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4
2021-12-27 16:17:53 -08:00
s-kumano
ff53ed24d2 fix NameError of docstring in broadcast_object_list (#69810)
Summary:
This PR fixes NameError of docstring in broadcast_object_list.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69810

Reviewed By: kimishpatel

Differential Revision: D33143167

Pulled By: jbschlosser

fbshipit-source-id: 99c076466ae4b4a332763b7546028c5097b417d7
2021-12-16 10:50:45 -08:00
Bryan Reese
4670f0f2c5 Set non-default backend names to lower case (#69400)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400

Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent.

Test Plan: Check the signals. There is no specific test for this, but all tests should pass.

Reviewed By: mrshenli

Differential Revision: D32836529

fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3
2021-12-07 07:58:46 -08:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
Shen Li
18955d3564 Raise warning when calling collectives on non-member group objects (#67639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639

Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32075151

Pulled By: mrshenli

fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
2021-11-02 20:04:07 -07:00
Shen Li
ce6f4b3a02 Setup c10d extension Backend class attr the same way as builtin ones (#66991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991

Currently, c10d extensions uses Backend.NAME to store the creator
function. However, builtin ones use that same field to store the
name. This commit makes c10d extensions comply with builtin ones,
and uses a dedicated `_plugins` field to store creator functions.

Thanks bryanmr for pointing this out.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31820307

Pulled By: mrshenli

fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc
2021-10-21 12:35:07 -07:00
Yi Wang
12137db5e3 Fix the slowdown of _object_to_tensor since 1.9 (#65721)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65721

#Closes: https://github.com/pytorch/pytorch/issues/65696

The bug is introduced in https://github.com/pytorch/pytorch/pull/55861, and it causes 100X slowdown since 1.9.
ghstack-source-id: 139128267

Test Plan:
Performance test:
```
import time

from torch.distributed.distributed_c10d import _object_to_tensor

start = time.time()
_object_to_tensor("x" * 50_000_000)
print("Time:", time.time() - start)
```

Reviewed By: rohan-varma

Differential Revision: D31219794

fbshipit-source-id: 1abec38f9d51361c1eab6ad5efd87b589322e208
2021-09-27 19:22:10 -07:00
Shen Li
2a81e8b8f1 Let all_reduce_coalesced and all_gather_coalesced return Future objects (#64722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722

`all_reduce_coalesced` and `all_gather_coalesced` are never publicly
released in our API docs. So, I would assume the blast radius to be small.

The motivation for this change to allow implementing
`all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce`
and `allgather` C++ cores and perform flatten and copy only on the Python
side. With that, we can then remove `all_reduce_coalesced` and
`all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode,
the copy-back logic after the communication will need to be chained
as a callback on the returned Future and use the chained child Future
as the return value (otherwise, we will need to wrap the child Future
into another work handle). This PR tries to test if we can directly
return a Future without breaking tests and internal use cases. If yes,
it will make the consolidation a lot easier.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D30830994

Pulled By: mrshenli

fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318
2021-09-10 07:45:25 -07:00
mrshenli
101a626330 Improve distributed.get_rank() API docstring (#63296)
Summary:
See discussion in https://pytorch.slack.com/archives/CBHSWPNM7/p1628792389008600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63296

Reviewed By: cbalioglu

Differential Revision: D30332042

Pulled By: mrshenli

fbshipit-source-id: 3a642fda2e106fd35b67709ed2adb60e408854c2
2021-08-27 11:34:55 -07:00
Kiuk Chung
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
Gao, Xiang
2d103025a5 Adding warning on isend about modifying after send (#61875)
Summary:
This is a standard limitation on communication collective libraries. For example:

https://www.open-mpi.org/doc/v4.0/man3/MPI_Isend.3.php
```
A nonblocking send call indicates that the system may start copying data out of the send buffer. The sender should not modify any part of the send buffer after a nonblocking send operation is called, until the send completes.
```

http://openucx.github.io/ucx/api/latest/html/group___u_c_p___c_o_m_m.html#ga8323878b60f426c630d4ff8996ede3cc
```
The user should not modify any part of the buffer after this operation is called, until the operation completes.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61875

Reviewed By: suo

Differential Revision: D29783720

Pulled By: mrshenli

fbshipit-source-id: 78fd047c74449f77b906f3766a6c2bc29499847d
2021-07-29 07:37:18 -07:00
Marjan Fariborz
994434ad16 Adding complex number support for all_to_all/scatter (#61299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299

Modifying all_to_all and scatter to support complex numbers as well as float numbers.

Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled

Reviewed By: wanchaol

Differential Revision: D29563938

fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9
2021-07-20 15:45:34 -07:00
Yu Guo
a50a389ca6 Revert D29701479: [pytorch][PR] Remove _broadcast_object() from ZeroRedundancyOptimizer
Test Plan: revert-hammer

Differential Revision:
D29701479 (9b5d9b4049)

Original commit changeset: c8d5f9057b32

fbshipit-source-id: 35ab1f399513fb9d1c4e73b1fa906e559d2a6994
2021-07-15 10:03:08 -07:00
Andrew Gu
9b5d9b4049 Remove _broadcast_object() from ZeroRedundancyOptimizer (#61539)
Summary:
Revised version of https://github.com/pytorch/pytorch/issues/60573.

**Overview:**
This makes two changes:
- It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively.
- It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`.

The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device.

In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU.

Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows:
`source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU`
Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes:
`destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539

Test Plan:
I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct.

The existing `ZeroRedundancyOptimizer` tests pass.
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```

The existing `broadcast_object_list()` test passes:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list
```

Reviewed By: zou3519

Differential Revision: D29701479

Pulled By: andwgu

fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9
2021-07-14 17:36:30 -07:00
Bo Wang
ab27399566 Make broadcast_object_list accept a device parameter. (#61305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305

Part I (this PR): Add dist_device argument to broadcast_object_list API
Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API
	 Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573

Context: https://github.com/pytorch/pytorch/issues/60062

Test Plan:
Run the following on DevGpus with two cuda devices

$python setup.py develop    --- run this build on DevGPU
$BACKEND='nccl' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v
$BACKEND='gloo' WORLD_SIZE=2 with-proxy  python test/distributed/test_distributed_fork.py  TestDistBackendWithFork.test_broadcast_object_list --v

Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop
Test on CPU devvm:

$ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py

Imported from OSS

Differential Revision:
D29566538
D29566538

Reviewed By: iramazanli, mrshenli

Pulled By: bowangbj

fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef
2021-07-14 11:43:17 -07:00
Philip Meier
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
Ruilin Chen
38c3116813 [hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup
Summary:
This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes:
1. Add necessary member variables set up.
2. Create new fast kernel and add fast kernel lookup support
3. Add intra-host all2all and cross-host all2all logic.

Test Plan:
UT
```
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn
```
```
buck test caffe2/torch/fb/hpc/tests:model_sharder_test
```
QPS check:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps
```
with diff:
dec inline_cvr:
table-wise -> table-wise (82K):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION

table-wise -> column-wise (80k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873

column-wise:
dec inline_cvr:
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k)

https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k)

row-wise(62k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab

table-wise(90k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608

10x ctr_mbl_feed:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo
```
column-wise:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k)

w/o diff:
dec inline_cvr:
column-wise (87K):
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14

row-wise (60k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf

table-wise (91k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41

10x ctr_mbl_feed:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k)

NE check(use deterministic reading D28711400)
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193
```
w/o this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|window_qps 491.5199890136719
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

w this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33  master      ] Publishing batch metrics: qps-qps|total_examples 1867776.0
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION

Reviewed By: JadeNie

Differential Revision: D28689126

fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3
2021-06-17 22:25:25 -07:00
clint
78011bc0ce typofix (torch.zero to torch.zeros) in docstring (#59703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703

Reviewed By: ezyang

Differential Revision: D29145998

Pulled By: H-Huang

fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf
2021-06-15 21:12:42 -07:00
Yi Wang
48ea7c808d [C10d] Support subgroups (#59111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111

Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.

Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.

Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.

#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed

Reviewed By: rohan-varma

Differential Revision: D28495672

fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
2021-06-09 22:35:11 -07:00
Can Balioglu
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
Luca Wehrstedt
8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
Liang Luo
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
Rohan Varma
19bcbfc5cf [c10d] Use pg wrapper in detailed debug mode (#58281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`.

As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.

Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.

Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.
ghstack-source-id: 129817857

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28402301

fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede
2021-05-25 09:55:52 -07:00
Rohan Varma
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
Rohan Varma
071d49a970 Document monitored barrier (#58322)
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322

Reviewed By: SciPioneer

Differential Revision: D28595405

Pulled By: rohan-varma

fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
2021-05-21 19:04:57 -07:00
Yi Wang
314a578154 Clang format distributed_c10d.py (#58435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435

Prepare for #53962

ghstack-source-id: 129171617

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D28490326

fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89
2021-05-17 16:47:35 -07:00
Rohan Varma
e90fcffb65 [c10d] Log when store based barrier succeeds (#57711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711

Seeing some hangs/issues around store based barrier internally, would
be good to have this log to indicate whether store based barrier has completed
successfully or not for a particular rank to debug further.
ghstack-source-id: 128605600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28249087

fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8
2021-05-10 21:09:40 -07:00
Liang Luo
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
Wanchao Liang
a970e525fd make ProcessGroup.Options.timeout argument private in python (#56531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531

per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.

This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.

cc pritamdamania87 rohan-varma

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27893395

Pulled By: wanchaol

fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
2021-04-21 17:55:10 -07:00
Rohan Varma
b7d5a0cf10 [c10d] sequence number in process group (#55319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319

Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.

The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.

This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27562769

fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
2021-04-21 10:59:24 -07:00
Sam Estep
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
Rohan Varma
ce05b7a324 [c10d] Remove deprecated use of torch.LongTensor, torch.ByteTensor (#55861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861

APIs such as torch.LongTensor and torch.ByteTensor are deprecated and
the recommended API is torch.tensor(args, dtype=...). Use this API in
distributed_c10d.
ghstack-source-id: 126777875

Test Plan: CI

Reviewed By: pbelevich

Differential Revision: D27726600

fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef
2021-04-18 14:12:02 -07:00
Rohan Varma
bbc4c775bb [reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.

Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758424

fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
2021-04-14 12:26:54 -07:00
Rohan Varma
48c73d24b8 Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do
Test Plan: revert-hammer

Differential Revision:
D27523060 (a5290adea5)

Original commit changeset: fa05e4f8ad8a

fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a
2021-04-13 21:33:09 -07:00
Rohan Varma
a5290adea5 [c10d] monitored_barrier: ensure all ranks pass or none do (#55197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197

From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.

In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.

This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:

Nonzero ranks call send()
Nonzero ranks call recv()

Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them

Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27523060

fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
2021-04-13 19:01:25 -07:00
Rohan Varma
19f15317a0 [BE][Docs] Improve dist.new_group doc (#55660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660

Noticed this doc was missing clarification on nccl env vars that
init_process_group docs have. Also, specify default behavior when backend=None
is passed in.
ghstack-source-id: 126251116

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27672208

fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4
2021-04-11 16:16:18 -07:00
Szymon Migacz
8e78a1b084 [Resubmit] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#52757)
Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/51739
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52757

Reviewed By: cbalioglu

Differential Revision: D26646843

fbshipit-source-id: df4962ef86ea465307e39878860b9fbbcc958d52
2021-04-06 11:32:26 -07:00
Rohan Varma
19a0eb4cdb [c10d] Monitored barrier: option to collect all failed ranks (#55010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010

Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.

This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27447787

fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
2021-04-04 21:39:54 -07:00
Rohan Varma
d185719455 Expose dist.monitored_barrier() API (#53787)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787

Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications.
ghstack-source-id: 125124315

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26965127

fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758
2021-03-29 14:15:37 -07:00
Jeff Yang
0435059ddf docs: fix docstring signature in all_reduce_multigpu (#54665)
Summary:
fixes https://github.com/pytorch/pytorch/issues/43500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54665

Reviewed By: ezyang

Differential Revision: D27340481

Pulled By: rohan-varma

fbshipit-source-id: d53c36b41dd26c7a791d3674a5b4b67daaadae13
2021-03-26 11:08:32 -07:00
Wanchao Liang
133000fe7a [distributed] add processgroup options as argument (#53663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663

This add the processgroup option as an optional argument to new_group
and init_processgroup, this allows user to pass in a initialized
processgroup option for gloo and nccl.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D26968857

Pulled By: wanchaol

fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2
2021-03-18 01:04:17 -07:00
Michael Suo
87b6702833 [distributed] make the pickler in distributed_c10d pluggable (#53060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060

As title. We would like to use alternative pickler/unpickler
implementations, to make it possible to send objects over the wire that
are coming from a torch.package

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D26737317

Pulled By: suo

fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a
2021-03-01 21:37:48 -08:00
Howard Huang
b56f59ea20 Revert D26599390: [pytorch][PR] Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py
Test Plan: revert-hammer

Differential Revision:
D26599390 (075bbe0d6a)

Original commit changeset: d822658076f7

fbshipit-source-id: 6c4421f4de99794ea66780175af549cef9410a20
2021-02-24 05:38:34 -08:00
Szymon Migacz
075bbe0d6a Fix for incorrect usage of logging in torch/distributed/distributed_c10d.py (#51739)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/51428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51739

Reviewed By: bdhirsh

Differential Revision: D26599390

fbshipit-source-id: d822658076f7b08ebfde3dc9994159539490fda0
2021-02-23 22:30:37 -08:00
Rohan Varma
c255628134 [Collective APIs] Make python object collective API args consistent (#50625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625

Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D25932012

fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
2021-01-30 19:47:16 -08:00
Pritam Damania
16e5af41da Fix store based barrier to only use 'add'. (#49930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930

Certain store implementations don't work well when we use get() and
add() on the same key. To avoid this issue, we only use add() in the store
based barrier. The buggy store implementations can't be properly fixed due to
legacy reasons.

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25725386

fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7
2021-01-05 12:46:24 -08:00
Jagadish Krishnamoorthy
c115957df0 [distributed] Provide parameter to pass GPU ID in barrier function (#49069)
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.

Add test cases to validate barrier device_ids.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Fixes https://github.com/pytorch/pytorch/issues/48110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069

Reviewed By: mrshenli

Differential Revision: D25658528

Pulled By: rohan-varma

fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
2021-01-05 11:27:54 -08:00
Samuel Marks
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
Pritam Damania
1043ecf68d Use store based barrier only for certain store types. (#49694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49694

The store based barrier introduced in
https://github.com/pytorch/pytorch/pull/49419 broke for certain store types.
This is a quick fix to resolve the issues for other store types.
ghstack-source-id: 119006874

Test Plan: 1) waitforbuildbot

Reviewed By: ppwwyyxx, rohan-varma

Differential Revision: D25668404

fbshipit-source-id: 751fb8b229ad6f50ee9c50f63a70de5a91c9eda5
2020-12-21 18:41:28 -08:00
Pritam Damania
43f6da787e Use store based barrier in init_process_group. (#49419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49419

As described in https://github.com/pytorch/pytorch/issues/48110, the
newly introduced `barrier()` in `init_process_group` messes up NCCL
communicator state since it uses a bunch of default devices to perform an
allreduce which simulates a barrier(). As a ressult, subsequent NCCL operations
might not behave as expected.
ghstack-source-id: 118861776

Test Plan:
1) unit test added.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D25566550

fbshipit-source-id: ab083b67b634d7c515f4945deb228f959b27c936
2020-12-18 00:02:54 -08:00
Pritam Damania
db2ecefc01 [reland] Support torch.distributed.irecv(src=None, ...) (#49383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49383

Reland of https://github.com/pytorch/pytorch/pull/47137
ghstack-source-id: 118735407

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D25551910

fbshipit-source-id: 2e1f2f77e7c69204056dfe6ed178e8ad7650ab32
2020-12-16 19:39:23 -08:00
Omkar Salpekar
4b3f05a471 [Docs] Updating init_process_group docs to indicate correct rank range (#49131)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49131

Users frequently assume the correct range of ranks is 1 ...
`world_size`. This PR udpates the docs to indicate that the correct rank range
users should specify is 0 ... `world_size` - 1.

Test Plan: Rendering and Building Docs

Reviewed By: mrshenli

Differential Revision: D25410532

fbshipit-source-id: fe0f17a4369b533dc98543204a38b8558e68497a
2020-12-16 10:26:04 -08:00
Pritam Damania
f2ba3c1621 Use group.WORLD appropriately in process group initialization. (#48767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48767

As part of investigating
https://github.com/pytorch/pytorch/issues/48464, I realized some weird
inconsistency in how we use `_default_pg` and `group.WORLD`. `group.WORLD`
apparently was an `object()` and never changed despite `_default_pg` changing.
In this sense, `group.WORLD` was being used a constant to refer to the default
pg, but wasn't of type PG at all. In fact the passed in group is also compared
via `==` to `group.WORLD` in many places, and it just worked since the default
argument was `group.WORLD`.

To clean this up, I got rid of `_default_pg` completely and instead used
`group.WORLD` as the default pg throughout the codebase. This also fixes the
documentation issues mentioned in
https://github.com/pytorch/pytorch/issues/48464.

#Closes: https://github.com/pytorch/pytorch/issues/48464
ghstack-source-id: 118459779

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25292893

fbshipit-source-id: 9a1703c71610aee2591683ab60b010332e05e412
2020-12-13 17:53:42 -08:00
Pritam Damania
7584161dfa Enhance new_group doc to mention using NCCL concurrently. (#48872)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48872

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
ghstack-source-id: 118148014

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25351778

fbshipit-source-id: f7f448dc834c47cc1244f821362f5437dd17ce77
2020-12-09 12:29:15 -08:00
Rohan Varma
b77ca9e829 [Docs] Add examples for new object-based c10d APIs (#43932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43932

Adds some basic examples to the documentation for each of the newly added
object-based collectibves.
ghstack-source-id: 117965966

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D23441838

fbshipit-source-id: 91344612952cfcaa71f08ccf2a2c9ed162ca9c89
2020-12-07 14:35:14 -08:00
Rohan Varma
02d89f9f1d scatter_object_list API for c10d (#43930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43930

Closes #23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.

Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.

It only works for Gloo because NCCL doesn't support scatter.
ghstack-source-id: 117904065

Reviewed By: mrshenli

Differential Revision: D23430686

fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b
2020-12-04 18:55:57 -08:00
Pritam Damania
4b8d965f18 Revert D25292656: [pytorch][PR] Support torch.distributed.irecv(src=None, ...)
Test Plan: revert-hammer

Differential Revision:
D25292656 (4eb4db7c30)

Original commit changeset: beb018ba0b67

fbshipit-source-id: 5a13055e50ed90731fee431e81c09a1871f6cc03
2020-12-04 16:57:06 -08:00
Tom Birch
4eb4db7c30 Support torch.distributed.irecv(src=None, ...) (#47137)
Summary:
Calling torch.distributed.irecv(src=None) fails with "The global rank None is not part of the group". This change calls recv_anysource if src is None. Tested locally with MPI backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47137

Reviewed By: heitorschueroff

Differential Revision: D25292656

fbshipit-source-id: beb018ba0b676924aeaabeb4a4d6acf96e4a1926
2020-12-04 13:56:36 -08:00
Xu Zhao
915050ed66 Fix typing errors in torch.distributed.distributed_c10d.* (#47532)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47532

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952501

Pulled By: xuzhao9

fbshipit-source-id: 9b2dd1069eb1729c24be00f46da60d6a0439a8da
2020-11-16 23:27:51 -08:00
Mingzhe Li
66f9b1de1b [NCCL] enable p2p tests (#47797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797

NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24863808

fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
2020-11-12 10:44:50 -08:00
Omkar Salpekar
32b4b51254 [Docs] Minor doc fixes for init_process_group (#47644)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644

Minor Update to the init_process_group docs.
ghstack-source-id: 116441798

Test Plan: CI

Reviewed By: jiayisuse, mrshenli

Differential Revision: D24633432

fbshipit-source-id: fbd38dab464ee156d119f9f0b22ffd0e416c4fd7
2020-11-11 15:21:30 -08:00
Xu Zhao
73a3e70b24 Add type annotations for torch._C._distributed_c10d module. (#46623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24761606

Pulled By: xuzhao9

fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569
2020-11-06 01:28:48 -08:00
Rohan Varma
c7183c9878 Fix object-based collectives API to use torch.cuda.current_device instead of (#46897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897

These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.

Also adds/tidies up some documentation.
ghstack-source-id: 115359633

Test Plan: Modified unittests

Reviewed By: divchenko

Differential Revision: D24556177

fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
2020-10-28 18:12:50 -07:00
Omkar Salpekar
5e2f17d77a Add NCCL_ASYNC_ERROR_HANDLING to docs (#46856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856

Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs,
similar to how NCCL_BLOCKING_WAIT is curently described.
ghstack-source-id: 115186877

Test Plan: CI, verifying docs change

Reviewed By: jiayisuse

Differential Revision: D24541822

fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec
2020-10-26 14:41:32 -07:00
Luca Wehrstedt
f230245c06 Revert D24422354: [pytorch][PR] fix-process-group-counter
Test Plan: revert-hammer

Differential Revision:
D24422354 (caed29a069)

Original commit changeset: 32493cc2001d

fbshipit-source-id: 9b633f738ea555f45031056689f780dde8eda859
2020-10-23 08:04:37 -07:00
Brian Hirsh
db83ddcb86 small doc fix (#46599)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46599

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24426181

Pulled By: bdhirsh

fbshipit-source-id: d0900d5c43574c80f1bf614824eafd21ba6a9caf
2020-10-21 20:17:31 -07:00
Joel Lamy-Poirier
caed29a069 fix-process-group-counter (#46563)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46561

A minimal fix to issue https://github.com/pytorch/pytorch/issues/46561. Increment the global variable `_group_count` at the same time as the others so the global state remains consistent in case of a failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46563

Reviewed By: zou3519

Differential Revision: D24422354

Pulled By: mrshenli

fbshipit-source-id: 32493cc2001d21ad366c396d16c303936959434e
2020-10-21 13:03:53 -07:00
Alexander Golynski
e7e919fc34 Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24294437

Pulled By: gmagogsfm

fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6
2020-10-14 16:27:37 -07:00
Brian Hirsh
1f791c06f0 adding BAND/BOR/BXOR reduce ops to unsupported list for complex numbers. added tests (#46270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46270

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24284702

Pulled By: bdhirsh

fbshipit-source-id: 7e6c3fce83a4367808a638f0400999399b2c35b0
2020-10-14 08:48:14 -07:00
Brian Hirsh
c02efdefa8 adding complex support for distributed functions and . fix #45760 (#45879)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45879

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24127949

Pulled By: bdhirsh

fbshipit-source-id: 8061b14fa1c0adbe22b9397c2d7f92618556d223
2020-10-12 12:44:47 -07:00
Mingzhe Li
281463ba0b [NCCL] Enable send/recv tests (#45994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994

Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24172484

fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
2020-10-09 15:00:39 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
Pritam Damania
a2b4177c5b Add barrier() at the end of init_process_group and new_group. (#45181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181

`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112

Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.

Reviewed By: mrshenli

Differential Revision: D23858025

fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
2020-09-25 15:46:59 -07:00
Rohan Varma
bee97d5be0 Document the default behavior for dist.new_group() when ranks=None (#44000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000

This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D23465034

fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
2020-09-17 11:30:37 -07:00
Rohan Varma
fbea2ee917 broadcast_object API for c10d (#43887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887

As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks.  This has been a long-requested feature, so would be good for Pytorch to natively support this.

The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.

Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436

Reviewed By: mrshenli

Differential Revision: D23422577

fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
2020-09-01 18:54:17 -07:00
Akihiro Nitta
f17d7a5556 Fix exception chaining in torch/ (#43836)
Summary:
## Motivation
Fixes https://github.com/pytorch/pytorch/issues/43770.

## Description of the change
This PR fixes exception chaining only in files under `torch/` where appropriate.
To fix exception chaining, I used either:
1. `raise new_exception from old_exception` where `new_exception` itself seems not descriptive enough to debug or `old_exception` delivers valuable information.
2. `raise new_exception from None` where raising both of `new_exception` and `old_exception` seems a bit noisy and redundant.
I subjectively chose which one to use from the above options.

## List of lines containing raise in except clause:
I wrote [this simple script](https://gist.github.com/akihironitta/4223c1b32404b36c1b349d70c4c93b4d) using [ast](https://docs.python.org/3.8/library/ast.html#module-ast) to list lines where `raise`ing in `except` clause.

- [x] 000739c31a/torch/jit/annotations.py (L35)
- [x] 000739c31a/torch/jit/annotations.py (L150)
- [x] 000739c31a/torch/jit/annotations.py (L158)
- [x] 000739c31a/torch/jit/annotations.py (L231)
- [x] 000739c31a/torch/jit/_trace.py (L432)
- [x] 000739c31a/torch/nn/utils/prune.py (L192)
- [x] 000739c31a/torch/cuda/nvtx.py (L7)
- [x] 000739c31a/torch/utils/cpp_extension.py (L1537)
- [x] 000739c31a/torch/utils/tensorboard/_pytorch_graph.py (L292)
- [x] 000739c31a/torch/utils/data/dataloader.py (L835)
- [x] 000739c31a/torch/utils/data/dataloader.py (L849)
- [x] 000739c31a/torch/utils/data/dataloader.py (L856)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L186)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L189)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L424)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1279)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1283)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1356)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1388)
- [x] 000739c31a/torch/testing/_internal/common_utils.py (L1391)
- [ ] 000739c31a/torch/testing/_internal/common_utils.py (L1412)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L310)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L329)
- [x] 000739c31a/torch/testing/_internal/codegen/random_topo_test.py (L332)
- [x] 000739c31a/torch/testing/_internal/jit_utils.py (L183)
- [x] 000739c31a/torch/testing/_internal/common_nn.py (L4789)
- [x] 000739c31a/torch/onnx/utils.py (L367)
- [x] 000739c31a/torch/onnx/utils.py (L659)
- [x] 000739c31a/torch/onnx/utils.py (L892)
- [x] 000739c31a/torch/onnx/utils.py (L897)
- [x] 000739c31a/torch/serialization.py (L108)
- [x] 000739c31a/torch/serialization.py (L754)
- [x] 000739c31a/torch/distributed/rpc/_testing/faulty_agent_backend_registry.py (L76)
- [x] 000739c31a/torch/distributed/rpc/backend_registry.py (L260)
- [x] 000739c31a/torch/distributed/distributed_c10d.py (L184)
- [x] 000739c31a/torch/_utils_internal.py (L57)
- [x] 000739c31a/torch/hub.py (L494)
- [x] 000739c31a/torch/contrib/_tensorboard_vis.py (L16)
- [x] 000739c31a/torch/distributions/lowrank_multivariate_normal.py (L100)
- [x] 000739c31a/torch/distributions/constraint_registry.py (L142)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43836

Reviewed By: ailzhang

Differential Revision: D23431212

Pulled By: malfet

fbshipit-source-id: 5f7f41b391164a5ad0efc06e55cd58c23408a921
2020-08-31 20:26:23 -07:00
Shen Li
2f52748515 Publish all_gather_object and gather_object docs (#43772)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43772

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D23398495

Pulled By: rohan-varma

fbshipit-source-id: 032e1d628c0c0f2dec297226167471698c56b605
2020-08-31 13:28:00 -07:00
Rohan Varma
f22aa601ce All Gather and gather APIs for Python Objects (#42189)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189

Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.

As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:

`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.

The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load

Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.

If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433

Reviewed By: mrshenli

Differential Revision: D22785387

fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
2020-08-06 13:30:25 -07:00
Tongzhou Wang
3001facd7a [doc] [distributed] fix typo (#39264)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39264

Differential Revision: D21791426

Pulled By: mrshenli

fbshipit-source-id: c3aa8fda1893aa3c0f9ad3db7da25f1ee80303e8
2020-06-01 19:19:46 -07:00
Quang Luong
9d7a79ac27 [Caffe2] raise exceptions instead of str (#37744)
Summary:
Some exceptions are not correctly wrapped inside a class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37744

Differential Revision: D21388197

Pulled By: mrshenli

fbshipit-source-id: 2d69e2543c2e05116c367d137968b982c254d2dc
2020-05-05 13:34:33 -07:00
Pritam Damania
136d84dd38 Enhance error message for MPI unavailability. (#36781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36781

Mention that you need to to build PyTorch from source to enable MPI.
Additional context:
https://discuss.pytorch.org/t/distributed-pytorch-with-mpi/77106.
ghstack-source-id: 102341246

Test Plan: waitforbuildbot

Differential Revision: D21082009

fbshipit-source-id: 3a3286349e71322726a341dfc743b5978c7d9a56
2020-04-18 14:45:44 -07:00
Sudarshan Raghunathan
739351fac4 Fix linter warning: replace f-strings with str.format for Py2 compat (#35492)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35492

Test Plan: Imported from OSS

Differential Revision: D20998727

Pulled By: drdarshan

fbshipit-source-id: 54f34a7649a2772ad030b456f1b50aba831ce2e0
2020-04-13 18:43:58 -07:00
Feng Tian
762270c51f add c10d dynamic loading mechanism and unit test (#28068)
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.

related RFC is in: https://github.com/pytorch/pytorch/issues/27955

Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068

Differential Revision: D19174838

Pulled By: agolynski

fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
2020-04-02 15:46:51 -07:00
Dhiraj D Kalamkar
945d7a7408 Add All-to-all comms support to distributed module and MPI backend (#32361)
Summary:
As described in https://github.com/pytorch/pytorch/issues/32345, a prototype implementation to add an alltoall communication primitive to torch.distributed module and ProcessGroup abstract interface. Also, implements alltoall in ProcessGroupMPI backend.

mnaumovfb JianpingChen066 dmudiger srinivas212 Jianhui-Li mshiryaev ftian1

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini xush6528 osalpekar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32361

Reviewed By: mrshenli

Differential Revision: D20635481

Pulled By: srinivas212

fbshipit-source-id: 3dd0af800ce55d02f02813cde550e3a0f1a287d2
2020-04-01 08:57:12 -07:00
Ankesh Anand
45c45195cd Remove warning about building from source to use the NCCL backend (#34051)
Summary:
I think this warning isn't true anymore, and the NCCL backend works without PyTorch needing to be built from source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34051

Differential Revision: D20195310

Pulled By: ezyang

fbshipit-source-id: 14f879a8c43ea5efdbdf0f638792ea2b90011f4a
2020-03-02 13:43:43 -08:00
Rohan Varma
6cb9e6b015 Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434

Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
2020-02-19 17:17:17 -08:00
Rohan Varma
d4e4beddc4 Revert D19871946: [distributed] pass in timeout to TCP store when initializing
Test Plan: revert-hammer

Differential Revision:
D19871946

Original commit changeset: dd002180c4c8

fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
2020-02-16 19:37:44 -08:00
Rohan Varma
df47a3abe0 [distributed] pass in timeout to TCP store when initializing (#33325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325

Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
2020-02-16 17:59:44 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
Alexander Golynski
23695ab23f Moving python allgather_coalesced impl from Py to C. (#29059)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29059
This is a resubmit of reverted diff D18209289 ( PR #28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed
2019-11-04 08:34:34 -08:00
Shen Li
9041e29d94 Revert D18209289: Moving python allgather_coalesced impl from Py to C
Test Plan: revert-hammer

Differential Revision:
D18209289

Original commit changeset: c5a4c4a1aaa0

fbshipit-source-id: d4865e3f8c4eeee285c711e5c2250b8c9f9b0d25
2019-11-01 11:23:41 -07:00
Alexander Golynski
22a346ee34 Moving python allgather_coalesced impl from Py to C
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28857

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: mrshenli

Differential Revision: D18209289

fbshipit-source-id: c5a4c4a1aaa07286a05a7c842dda428eeb46f696
2019-11-01 10:34:23 -07:00
Alexander Golynski
45dab56153 adding python all_gather coalesced functionality and testing. (#28634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28634

caveat 1: this only works in sync mode.
caveat 2: this is going to go away and be replaced by c++ implementation

Test Plan: buck test caffe2/test:distributed_gloo -- test_all_gather_coalesced

Reviewed By: mrshenli

Differential Revision: D18123422

fbshipit-source-id: cfb9950d5d54c6181a5240e7cc9fed88ed47f5d9
2019-10-28 08:12:36 -07:00
Shihao Xu
59402f51cf Make init_method url appending step re-usable by both init_process_group and init_model_parallel(init_rpc) (#28226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226

# Goal

Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.

The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.

We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.

# Solution

- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.

Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```

```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```

```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```

```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```

Differential Revision: D5524494

fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
2019-10-23 21:51:08 -07:00
zou3519
e5d6b75319 Bag of documentation fixes; fix more sphinx warnings (#27850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850

Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).

Test Plan: - built and viewed the documentation for each change locally.

Differential Revision: D17908123

Pulled By: zou3519

fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
2019-10-15 07:31:14 -07:00
Pritam Damania
646e214706 ProcessGroupNCCL should respect timeout passed in to init_process_group. (#27224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27224

As part of adding error handling to NCCL, we are now able to specify a
timeout for operations using ProcessGroupNCCL. Although, this timeout had a
default of 10 seconds and didn't respect the timeout specified in
init_process_group.

In this change, I've ensured we pass the appropriate timeout to
ProcessGroupNCCL.
ghstack-source-id: 91283548

Test Plan:
Added unit test to verify timeout passed in to init_process_group is
respected.

Differential Revision: D17717992

fbshipit-source-id: c73320187f1f3b2693ba1e177d80646e282d01a2
2019-10-04 13:28:57 -07:00
Vikas Mehta
3a18e2e768 support re-creating/destroying process groups when some trainers recover after failures (#26912)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26912

group name is used as prefix in the c10d store and without a consistent name process group cannot be initialized.

When process group doesn't have an explicit name (only WORLD (default) process group can have an explicit name), we use global _group_counter to generate the name. We need to reset the counter on destruction to allow consistent value to be generated when we re-create process groups after some trainers recover from failure.

Test Plan: existing tests passed

Reviewed By: mrshenli

Differential Revision: D17594268

fbshipit-source-id: 17f4d2746584dadaa5d468085d871ff3e95a1c84
2019-09-27 16:16:58 -07:00
Pieter Noordhuis
ebdb32c749 Remove global group name tracking for ProcessGroupNCCL (#25905)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25905

Now that we can detect and recover from failures in NCCL we should
allow processes that are started at different times (and perhaps have
had previous NCCL process group instances), to eventually be part of
the same process group. Keeping track of group names in global
variables prevents that, because the processes will be out of sync.

This commit removes the global group name maps and defers
responsibility of isolating access to the same store from multiple
process groups to the store itself. Users can use `c10d::PrefixStore`
to derive new store instances whose keyspace is scoped to some
prefix. Functionally, this is identical to keeping a global map and
using a group name, but also gives more flexibility to the front-end
API to reset state and have processes that have started at different
times to join the same process group.
ghstack-source-id: 89804865

Test Plan: Tests pass.

Differential Revision: D17281416

fbshipit-source-id: eab3b48463a9b0ef24aedeca76e2bb970b9f33ef
2019-09-11 06:56:33 -07:00
Pieter Noordhuis
500e72aaa5 Make scatter/gather arguments optional (#25575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25575

For both scatter and gather, only the source and destination rank,
respectively, need to supply a list of tensors. The `scatter_list` and
`gather_list` arguments were mandatory, however, and this has resulted
in some confusion. This commit makes both the `scatter_list` and
`gather_list`, and the `src` and `dst` arguments optional.

Closes #25463.

Test Plan: Imported from OSS

Differential Revision: D17164253

fbshipit-source-id: a16bc208c87a1c96163c1a86d4a7ca8634a26f95
2019-09-03 12:27:05 -07:00
Pieter Noordhuis
493f7bd817 Error phrasing in torch.distributed helper functions
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25574

Test Plan: Imported from OSS

Differential Revision: D17164254

fbshipit-source-id: 13dbcffd67c2b5425c722b2b21765345a85a3872
2019-09-03 12:27:01 -07:00
jfc4050
590619ab8c Support all_reduce a list of same-device tensors #21640 (#24949)
Summary:
addresses https://github.com/pytorch/pytorch/issues/21640 for CPU tensors and the Gloo backend.

Questions:
- ~~currently takes `AllreduceOptions`, since all of the options are the same. Would it be better to make a new `AllreduceCoalescedOptions` class?~~
- ~~I decided to inherit from `ProcessGroupGloo::AsyncWork` instead of `AsyncAllreduceWork` to shorten the inheritance chain a bit and for consistency with existing classes. However, this means that the two `getFunction` methods are copy-pasted. Would inheriting from `AsyncAllreduceWork` be preferable?~~
- ~~should the work class be named `AsyncCoalescedAllreduceWork` or `AsyncAllreduceCoalescedWork`?~~

thank you!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24949

Differential Revision: D17055580

Pulled By: mrshenli

fbshipit-source-id: e63b5fcaec6021053ea960776a09ee8cf11d1ec2
2019-08-28 10:57:37 -07:00
Max Wang
c5845c4482 Add support for reduce-scatter in c10d (#18844)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18844
ghimport-source-id: c6b2f0032c7c2212be2000a9c1f262f63d878a97

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18844 Add support for reduce-scatter in c10d**
* #18820 Refactor ProcessGroupNCCL collective primitives

Reviewed By: mrshenli

Differential Revision: D14768369

fbshipit-source-id: a9def7a0da6e9cd995e982371cc1e22f3df1a156
2019-04-26 13:46:57 -07:00
Kutta Srinivasan
b7323a94ad Cleanup init_process_group (#19033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033

torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced.

Reviewed By: mrshenli

Differential Revision: D14813070

fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042
2019-04-18 09:37:38 -07:00
Pieter Noordhuis
ce166d949d ProcessGroupMPI exists only if it is valid (#14809)
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.

This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.

This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809

Differential Revision: D14887937

Pulled By: pietern

fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
2019-04-10 21:36:35 -07:00
Shen Li
8f9b11cf33 Propagate ProcessGroup timeout to Store (#16571)
Summary:
closes #16520

Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!

Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571

Differential Revision: D13954527

Pulled By: mrshenli

fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
2019-04-09 12:36:28 -07:00
Pieter Noordhuis
7a19d3c9e1 Allow override of backend in dist.new_group() (#18595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595

There is no need to force the backend to be the same as the global
process group, as long as the backend is "nccl" or "gloo".

Reviewed By: mrshenli

Differential Revision: D14657204

fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b
2019-04-04 14:23:03 -07:00
Shen Li
c0ad6747a9 Highlight NCCL all_reduce and all_gather requirements (#18741)
Summary:
See #18689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18741

Differential Revision: D14726874

Pulled By: mrshenli

fbshipit-source-id: a92404c653e3c62fc23fa3ccacfb3b2959b2e307
2019-04-03 09:50:29 -07:00
Igor Fedan
36237c4893 Fix flake8 issues in gragrad test
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18727

Differential Revision: D14724887

Pulled By: ifedan

fbshipit-source-id: 8c1db6460303e746e4aea0142302b8d61277c067
2019-04-02 12:45:18 -07:00
Pieter Noordhuis
bdfdf6c2b9 C++ handler for gradient reduction (#18251)
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.

This should enable the following:

* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.

This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.

Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251

Reviewed By: mrshenli

Differential Revision: D14571899

Pulled By: pietern

fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
2019-04-01 14:30:02 -07:00
Edward Yang
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
Brian Johnson
fd04073e61 Fixed a formatting issue in doc comments (#17505)
Summary:
for torch.distributed.broadcast_multigpu per issue #17243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17505

Reviewed By: janewangfb

Differential Revision: D14373865

Pulled By: pietern

fbshipit-source-id: 6d7e91a3da50a7c9ba417ad852f7746eb5200043
2019-03-12 09:55:29 -07:00
Jane Wang
a2b9f7f484 add elastic zeus handler (#16746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746

as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group.

Reviewed By: aazzolini, soumith

Differential Revision: D13948151

fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6
2019-02-27 11:29:59 -08:00
hysts
cbefd0323b Fix typo
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17521

Differential Revision: D14237482

Pulled By: soumith

fbshipit-source-id: 636e0fbe2c667d15fcb649136a65ae64937fa0cb
2019-02-26 20:23:34 -08:00
Teng Li
2d3cf98b49 Making dist.get_default_group private for PT1 release (#14767)
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions.  It should really be private.

All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.

We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767

Reviewed By: pietern

Differential Revision: D13330655

Pulled By: teng-li

fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
2018-12-04 19:22:24 -08:00
Pieter Noordhuis
11ef5191ff Enable tests for CPU tensors in test_distributed.py (#14572)
Summary:
These were not enabled after adding support in the Gloo backend. The
argument checks in ProcessGroupGloo raised an error in two cases:

* If the input tensor list to scatter was ``[None]`` on processes other
  than the source process.
* If the output tensor list to gather was ``[None]`` on processes other
  than the destination process.

This commit prepares these arguments explicitly instead of boxing them
at the process group call site.

This fixes #14536.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572

Differential Revision: D13272812

Pulled By: pietern

fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8
2018-11-29 21:39:02 -08:00
Teng Li
9127ab3866 Fixed new_group won't work for two or more different rank groups (#14529)
Summary:
This fixed two things:

(1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID.  Otherwise, different processes will create different NCCL PG in different orders and can clash on these names.  This will fix the NCCL problem.

(2)  When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name.

With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends.

```
tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py
rank: 0 - val: 6.0
rank: 2 - val: 6.0
rank: 3 - val: 6.0
rank: 1 - val: 6.0
rank: 4 - val: 22.0
rank: 6 - val: 22.0
rank: 5 - val: 22.0
rank: 7 - val: 22.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529

Differential Revision: D13253434

Pulled By: teng-li

fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84
2018-11-29 19:57:47 -08:00
Teng Li
0d3cb91d8c Make env init_method support both env and args for rank and size (#14494)
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446

This was a supported behavior in old torch.distributed. We want to support it in the new release.

Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494

Differential Revision: D13253433

Pulled By: teng-li

fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
2018-11-29 18:48:20 -08:00
Pieter Noordhuis
4ec6bd7356 Add sourceRank() to ProcessGroup::Work (#14453)
Summary:
This function is only implemented for the subclasses where it makes
sense. If it's not overridden it will throw an error. Having this
function removes the need for a pointer passing hack to pass the
source rank of a recv operation back to the caller. Instead, the
caller can now call `source_rank` on the work object and achieve
the same result.

Closes #11804.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453

Differential Revision: D13230898

Pulled By: pietern

fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49
2018-11-29 09:16:53 -08:00
Pieter Noordhuis
0f62af4ab1 Add timeout kwarg to init_process_group (#14435)
Summary:
This applies to the gloo backend only. Timeout support for the NCCL and
MPI backends is tracked in issues #14371 and #14372 respectively.

When creating a new process group (either the global one or any subgroup
created through `new_group`) you can specify a timeout keyword
argument (of type datetime.timedelta). This timeout applies to all
collective operations executed against that process group, such that any
operation taking longer than the timeout will throw a runtime error.
Using a different, better catchable error type is tracked in #14433.

This fixes #14376.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435

Differential Revision: D13234317

Pulled By: pietern

fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6
2018-11-28 11:35:01 -08:00
Teng Li
b807970aea Tensor type checking and informative error messages for torch.distributed (#14204)
Summary:
This will address https://github.com/pytorch/pytorch/issues/13574

This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always.

test_distributed should cover all. Also tested both RunTime errors.

```
>>> a = torch.ByteTensor([])
>>> b = [a, a]
>>> dist.all_reduce(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce
    _check_single_tensor(tensor, "tensor")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor
    "to be a torch.Tensor type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type

>>> b = ["b"]
>>> dist.all_gather(b, a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather
    _check_tensor_list(tensor_list, "tensor_list")
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list
    "to be a List[torch.Tensor] type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204

Differential Revision: D13131526

Pulled By: teng-li

fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2
2018-11-19 18:30:54 -08:00
Tongzhou Wang
044d00516c Rename DistBackend -> Backend (#11830)
Summary:
Also add docs for get_backend, Backend, and reduce_op

fixes #11803

cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830

Differential Revision: D9927991

Pulled By: SsnL

fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48
2018-11-07 11:58:12 -08:00
Teng Li
1b64c0f8fe Error msg on TCP backend (#13596)
Summary:
Clean it up from my queue:

https://github.com/pytorch/pytorch/issues/12721

```
>>> torch.distributed.init_process_group(backend="tcp")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 275, in init_process_group
    backend = DistBackend(backend)
  File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 55, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backends for collective operations on CPU tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13596

Differential Revision: D12931196

Pulled By: teng-li

fbshipit-source-id: bb739b107ad7454e2e0a17430087161fedd4c392
2018-11-05 16:40:02 -08:00
Pieter Noordhuis
526460fc8b Use default timeout of 30 minutes for gloo backend (#13056)
Summary:
The existing default timeout was set at 10 seconds, which is too low
for asynchronous tasks that depend on a barrier to resynchronize.
Having a single timeout for all operations is not ideal and this will
be addressed in future commits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056

Reviewed By: teng-li

Differential Revision: D10558746

Pulled By: pietern

fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
2018-10-25 16:35:53 -07:00
Edward Yang
dfa03e94eb Fix mispelling of AVAILABLE. (#12016)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12016

Reviewed By: pietern

Differential Revision: D10010808

Pulled By: ezyang

fbshipit-source-id: ff6394ae9a53f7fdad2cadb4e019e09ac63bba96
2018-09-24 20:46:41 -07:00
Tongzhou Wang
540ef9b1fc Add distributed get_backend (#11715)
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.

cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715

Reviewed By: pietern

Differential Revision: D9889646

Pulled By: SsnL

fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
2018-09-18 10:56:24 -07:00
Pieter Noordhuis
7535d98ec4 Add message tag parameter to send/recv
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11490

Reviewed By: teng-li

Differential Revision: D9828116

Pulled By: pietern

fbshipit-source-id: 98be1ae84b6763ffb329e63c030c5e3ec0e748b7
2018-09-14 10:55:37 -07:00
Teng Li
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00