Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72657
_ProcessGroupWrapper check needs to be gated on Gloo availability,
this fails when gloo is not avail_ProcessGroupWrapper check needs to be gated
on Gloo availability, this fails when gloo is not avail.
ghstack-source-id: 148837056
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D34144848
fbshipit-source-id: 42a04918b968247f3259cd2cde5438e1265b04fe
(cherry picked from commit ba5de98939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71623
Enable gather_object on the nccl backend, since we already support `dist.gather` on nccl. This requires user to set the current device properly.
ghstack-source-id: 147754836
Test Plan: distributed_nccl_spawn -r test_gather_object
Reviewed By: zou3519
Differential Revision: D33701042
fbshipit-source-id: 39cff22947a7cac69d0c923b956dc10f25353a6f
(cherry picked from commit 6e6eff497f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70605
broadcast_object_list casted the sum of all object lengths to int from long causing overflows.
Test Plan:
Add a Tensor with >2GB storage requirement (in distributed_test.py) to object broadcast.
This Tensor is only added if test are running at Meta as github tests will oom.
Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.
Test used on server with GPUs:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object
buck test mode/dev-nosan //caffe2/test/distributed:distributed_gloo_spawn --local -- broadcast_object
Reviewed By: r-barnes
Differential Revision: D33405741
fbshipit-source-id: 972165f8297b3f5d475636e6127ed4a49adacab1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70336
broadcast_object_list casted the sum of all object lengths to int from long causing overflows.
Test Plan:
Increased size of Tensor used in object transfers to have >2GB storage requirement (in distributed_test.py)
Without fix the length will overflow and the program will request a negative sized Tensor:
```
RuntimeError: Trying to create tensor with negative dimension -2147482417: [-2147482417]
```
With fix it will pass the test.
Test used on server with GPUs:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn --local -- broadcast_object
Differential Revision: D33281300
fbshipit-source-id: 1bc83e8624edc14e747eeced7bc8a7a10e443ee4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69400
Hopefully this makes naming more consistent. Without this change, some tests will fail for plugins since values can be set to upper case in some cases. This should prevent that and make lookup and comparison consistent.
Test Plan: Check the signals. There is no specific test for this, but all tests should pass.
Reviewed By: mrshenli
Differential Revision: D32836529
fbshipit-source-id: 1b7d2b64e04fe0391b710aa6ed6d1e47df9027a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223
DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.
Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32366840
fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67639
Due to BC considerations, we cannot directly error out, as that
might break existing applications. Raise warnings first to improve
debuggability.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D32075151
Pulled By: mrshenli
fbshipit-source-id: 5680d420f5f6cd3f74a36616c03350e8a976b363
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66991
Currently, c10d extensions uses Backend.NAME to store the creator
function. However, builtin ones use that same field to store the
name. This commit makes c10d extensions comply with builtin ones,
and uses a dedicated `_plugins` field to store creator functions.
Thanks bryanmr for pointing this out.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D31820307
Pulled By: mrshenli
fbshipit-source-id: 259769ebfc80c0c9fc44d25498c8d19a3a09d1bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64722
`all_reduce_coalesced` and `all_gather_coalesced` are never publicly
released in our API docs. So, I would assume the blast radius to be small.
The motivation for this change to allow implementing
`all_reduce_coalesced` and `all_gather_coalesced` by re-using `allreduce`
and `allgather` C++ cores and perform flatten and copy only on the Python
side. With that, we can then remove `all_reduce_coalesced` and
`all_gather_coalesced` from C++ ProcessGroup APIs. For the async mode,
the copy-back logic after the communication will need to be chained
as a callback on the returned Future and use the chained child Future
as the return value (otherwise, we will need to wrap the child Future
into another work handle). This PR tries to test if we can directly
return a Future without breaking tests and internal use cases. If yes,
it will make the consolidation a lot easier.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D30830994
Pulled By: mrshenli
fbshipit-source-id: dcde0ed9245e9e8fee357b3588b07d540a4b6318
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910
Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:
```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```
An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.
For details see: https://github.com/pytorch/pytorch/issues/63874.
This change does a couple of things:
1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths
NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.
Test Plan: Unittests.
Reviewed By: cbalioglu
Differential Revision: D30529984
fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61299
Modifying all_to_all and scatter to support complex numbers as well as float numbers.
Test Plan: buck run //caffe2/test/distributed:distributed_gloo_fork -- test_name --print-passing-details --run-disabled
Reviewed By: wanchaol
Differential Revision: D29563938
fbshipit-source-id: 59e436b3fa1aee3d5195cbcffd39587e642c76b9
Summary:
Revised version of https://github.com/pytorch/pytorch/issues/60573.
**Overview:**
This makes two changes:
- It introduces a `map_location` argument to `broadcast_object_list()`. The argument specifies the device to load tensors contained in objects received from the broadcast. This change requires modifying the implementation of `_object_to_tensor()` and `_tensor_to_object()` to use `torch.save()` and torch.load()` respectively.
- It removes all calls to `_broadcast_object()` in `ZeroRedundancyOptimizer` and the corresponding test file in favor of `broadcast_object_list()`.
The default value of `map_location` is `None`, in which case `_object_to_tensor()` and hence `broadcast_object_list()` preserve their original behavior. Namely, contained tensors are loaded to their original device.
In `consolidate_state_dict()`, I specify `map_location=torch.device("cpu")` instead of `self._default_device`. This slightly changes the behavior from before when using `_broadcast_object()`. The reason I do so is that it saves one GPU to CPU data transfer since the action immediately after receiving the broadcasted `local_state_dict` is to copy it to CPU.
Explicitly, if `map_location=self._default_device`, then the data transfer path assuming NCCL backend is as follows:
`source GPU --[before serialize]--> source CPU --[before broadcast]--> source GPU --[broadcast]--> destination GPU --[before deserialize]--> destination CPU --[deserialize]--> destination GPU --[copy]--> destination CPU`
Hence, by setting `map_location=torch.device("cpu")` instead, the suffix becomes:
`destination CPU --[deserialize]--> destination CPU --[copy]--> destination CPU`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61539
Test Plan:
I added a test `test_broadcast_object_list_map_location()` that checks for both `map_location` as CPU and GPU that (1) tensors contained in broadcasted objects are appropriately loaded onto the specified device and (2) that the contents of the tensors are correct.
The existing `ZeroRedundancyOptimizer` tests pass.
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
The existing `broadcast_object_list()` test passes:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_broadcast_object_list
```
Reviewed By: zou3519
Differential Revision: D29701479
Pulled By: andwgu
fbshipit-source-id: c8d5f9057b32e5e9f40e8edc5b2cc25fb21414a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61305
Part I (this PR): Add dist_device argument to broadcast_object_list API
Part II: andwgu@ will deprecate _broadcast_object with the newly introduced API
Also include the changes to _object_to_tensor()/_tensor_to_object() with PR 60573
Context: https://github.com/pytorch/pytorch/issues/60062
Test Plan:
Run the following on DevGpus with two cuda devices
$python setup.py develop --- run this build on DevGPU
$BACKEND='nccl' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v
$BACKEND='gloo' WORLD_SIZE=2 with-proxy python test/distributed/test_distributed_fork.py TestDistBackendWithFork.test_broadcast_object_list --v
Build with distributed on: USE_DISTRIBUTE=1 python setup.py develop
Test on CPU devvm:
$ with-proxy python test/distributed/optim/test_zero_redundancy_optimizer.py
Imported from OSS
Differential Revision:
D29566538
D29566538
Reviewed By: iramazanli, mrshenli
Pulled By: bowangbj
fbshipit-source-id: 0bea52442551c5194acba85eadda16ba2ec4b6ef
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111
Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.
Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.
Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.
#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed
Reviewed By: rohan-varma
Differential Revision: D28495672
fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329
This PR is part of a stack that addresses the GitHub issue #41614; it introduces:
- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.
- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.
Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389
Test Plan: Run the existing tests since there are no behavioral changes.
Reviewed By: rohan-varma
Differential Revision: D28424978
fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753
TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.
One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.
Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D28603754
fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281
When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`.
As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.
Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.
Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.
ghstack-source-id: 129817857
Test Plan: ci
Reviewed By: SciPioneer
Differential Revision: D28402301
fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224
Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.
Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.
This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.
Once all of this passes we simply dispatch the collective to the underlying pg.
Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687
Test Plan: ci
Reviewed By: zhaojuanmao
Differential Revision: D28023981
fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322
Reviewed By: SciPioneer
Differential Revision: D28595405
Pulled By: rohan-varma
fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57711
Seeing some hangs/issues around store based barrier internally, would
be good to have this log to indicate whether store based barrier has completed
successfully or not for a particular rank to debug further.
ghstack-source-id: 128605600
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28249087
fbshipit-source-id: 644e5780519017ae780c3bc78bbe5def322db3f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531
per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.
This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.
cc pritamdamania87 rohan-varma
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D27893395
Pulled By: wanchaol
fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319
Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.
The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.
This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27562769
fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55861
APIs such as torch.LongTensor and torch.ByteTensor are deprecated and
the recommended API is torch.tensor(args, dtype=...). Use this API in
distributed_c10d.
ghstack-source-id: 126777875
Test Plan: CI
Reviewed By: pbelevich
Differential Revision: D27726600
fbshipit-source-id: 07eb8168d93697593589002c93c3903ce29431ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990
Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.
Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27758424
fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197
From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.
In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.
This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:
Nonzero ranks call send()
Nonzero ranks call recv()
Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them
Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27523060
fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55660
Noticed this doc was missing clarification on nccl env vars that
init_process_group docs have. Also, specify default behavior when backend=None
is passed in.
ghstack-source-id: 126251116
Test Plan: Ci
Reviewed By: SciPioneer
Differential Revision: D27672208
fbshipit-source-id: 2e79d297174e135173bceb059450ea267367bde4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010
Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.
This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D27447787
fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53787
Per title, exposes a python-based monitored barrier API that we can use as part of debugability and may be useful for user applications.
ghstack-source-id: 125124315
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26965127
fbshipit-source-id: 6c7826e63758462e3e5111f28cced54cba76a758
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53663
This add the processgroup option as an optional argument to new_group
and init_processgroup, this allows user to pass in a initialized
processgroup option for gloo and nccl.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D26968857
Pulled By: wanchaol
fbshipit-source-id: 2ff73a009120b85e83ecde7c69956b731902abc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53060
As title. We would like to use alternative pickler/unpickler
implementations, to make it possible to send objects over the wire that
are coming from a torch.package
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D26737317
Pulled By: suo
fbshipit-source-id: 6bdef9824e48ef657dcad72cc5a9114e6612ea4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50625
Make API signatures consistent and provide default argument similar to
the tensor collectives.
ghstack-source-id: 120718121
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D25932012
fbshipit-source-id: d16267e236a65ac9d55e19e2178f9d9267b08a20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49930
Certain store implementations don't work well when we use get() and
add() on the same key. To avoid this issue, we only use add() in the store
based barrier. The buggy store implementations can't be properly fixed due to
legacy reasons.
Test Plan:
1) unit tests.
2) waitforbuildbot
Reviewed By: osalpekar
Differential Revision: D25725386
fbshipit-source-id: 1535e2629914de7f78847b730f8764f92cde67e7
Summary:
For a multi GPU node, rank and corresponding GPU mapping can be different.
Provide optional parameter to specify the GPU device number for the
allreduce operation in barrier function.
Add test cases to validate barrier device_ids.
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
Fixes https://github.com/pytorch/pytorch/issues/48110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069
Reviewed By: mrshenli
Differential Revision: D25658528
Pulled By: rohan-varma
fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49694
The store based barrier introduced in
https://github.com/pytorch/pytorch/pull/49419 broke for certain store types.
This is a quick fix to resolve the issues for other store types.
ghstack-source-id: 119006874
Test Plan: 1) waitforbuildbot
Reviewed By: ppwwyyxx, rohan-varma
Differential Revision: D25668404
fbshipit-source-id: 751fb8b229ad6f50ee9c50f63a70de5a91c9eda5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49419
As described in https://github.com/pytorch/pytorch/issues/48110, the
newly introduced `barrier()` in `init_process_group` messes up NCCL
communicator state since it uses a bunch of default devices to perform an
allreduce which simulates a barrier(). As a ressult, subsequent NCCL operations
might not behave as expected.
ghstack-source-id: 118861776
Test Plan:
1) unit test added.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D25566550
fbshipit-source-id: ab083b67b634d7c515f4945deb228f959b27c936
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49131
Users frequently assume the correct range of ranks is 1 ...
`world_size`. This PR udpates the docs to indicate that the correct rank range
users should specify is 0 ... `world_size` - 1.
Test Plan: Rendering and Building Docs
Reviewed By: mrshenli
Differential Revision: D25410532
fbshipit-source-id: fe0f17a4369b533dc98543204a38b8558e68497a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48767
As part of investigating
https://github.com/pytorch/pytorch/issues/48464, I realized some weird
inconsistency in how we use `_default_pg` and `group.WORLD`. `group.WORLD`
apparently was an `object()` and never changed despite `_default_pg` changing.
In this sense, `group.WORLD` was being used a constant to refer to the default
pg, but wasn't of type PG at all. In fact the passed in group is also compared
via `==` to `group.WORLD` in many places, and it just worked since the default
argument was `group.WORLD`.
To clean this up, I got rid of `_default_pg` completely and instead used
`group.WORLD` as the default pg throughout the codebase. This also fixes the
documentation issues mentioned in
https://github.com/pytorch/pytorch/issues/48464.
#Closes: https://github.com/pytorch/pytorch/issues/48464
ghstack-source-id: 118459779
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D25292893
fbshipit-source-id: 9a1703c71610aee2591683ab60b010332e05e412
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48872
Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.
However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
ghstack-source-id: 118148014
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D25351778
fbshipit-source-id: f7f448dc834c47cc1244f821362f5437dd17ce77
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43932
Adds some basic examples to the documentation for each of the newly added
object-based collectibves.
ghstack-source-id: 117965966
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23441838
fbshipit-source-id: 91344612952cfcaa71f08ccf2a2c9ed162ca9c89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43930Closes#23232. As part of addressing #23232, this PR adds support for scatter_object_list which is an API to scatter arbitrary picklable objects to all the other ranks.
The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The result of the `scatter` is stored as the first element of `scatter_object_output_list`, and the src rank is expected to provide an input list `scatter_object_input_list` which contains the objects to scatter.
Note that this API requires 1 broadcast and 2 scatters. This is because we must communicate the maximum object size to be scattered, which only the src rank knows about. After that, we also need to communicate the objects themselves as well as the true sizes of the object.
Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
It only works for Gloo because NCCL doesn't support scatter.
ghstack-source-id: 117904065
Reviewed By: mrshenli
Differential Revision: D23430686
fbshipit-source-id: f033b89cd82dadd194f2b036312a98423449c26b
Summary:
Calling torch.distributed.irecv(src=None) fails with "The global rank None is not part of the group". This change calls recv_anysource if src is None. Tested locally with MPI backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47137
Reviewed By: heitorschueroff
Differential Revision: D25292656
fbshipit-source-id: beb018ba0b676924aeaabeb4a4d6acf96e4a1926
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797
NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24863808
fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47644
Minor Update to the init_process_group docs.
ghstack-source-id: 116441798
Test Plan: CI
Reviewed By: jiayisuse, mrshenli
Differential Revision: D24633432
fbshipit-source-id: fbd38dab464ee156d119f9f0b22ffd0e416c4fd7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897
These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.
Also adds/tidies up some documentation.
ghstack-source-id: 115359633
Test Plan: Modified unittests
Reviewed By: divchenko
Differential Revision: D24556177
fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856
Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs,
similar to how NCCL_BLOCKING_WAIT is curently described.
ghstack-source-id: 115186877
Test Plan: CI, verifying docs change
Reviewed By: jiayisuse
Differential Revision: D24541822
fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994
Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24172484
fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785
Test Plan: unittest
Reviewed By: jiayisuse
Differential Revision: D23709848
fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181
`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.
To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.
Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.
#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112
Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.
Reviewed By: mrshenli
Differential Revision: D23858025
fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44000
This wasn't documented, so add a doc saying all ranks are used when
ranks=None
ghstack-source-id: 111206308
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D23465034
fbshipit-source-id: 4c51f37ffcba3d58ffa5a0adcd5457e0c5676a5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43887
As part of addressing #23232, this PR adds support for `broadcast_object_list` which is an API to broadcast arbitrary picklable objects to all the other ranks. This has been a long-requested feature, so would be good for Pytorch to natively support this.
The implementation approach follows a similar approach as https://github.com/pytorch/pytorch/pull/42189. The input is a list of objects to be broadcasted and it is in place, meaning all ranks part of the group will have their input list modified to contain the broadcasted objects from the src rank.
Note that the API is designed to match the tensor-based collectives other than supporting async_op. For now, it is a blocking call. If we see demand to support async_op, we will have to make more progress on merging work/future to support this.
ghstack-source-id: 111180436
Reviewed By: mrshenli
Differential Revision: D23422577
fbshipit-source-id: fa700abb86eff7128dc29129a0823e83caf4ab0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42189
Rehash of https://github.com/pytorch/pytorch/pull/28811, which was several months old.
As part of addressing https://github.com/pytorch/pytorch/issues/23232, this PR adds support for the following APIs:
`allgather_object` and `gather_object` to support gather/allgather of generic, pickable Python objects. This has been a long-requested feature so PyTorch should provide these helpers built-in.
The methodology is what is proposed in the original issue:
1) Pickle object to ByteTensor using torch.save
2) Comm. tensor sizes
3) Copy local ByteTensor into a tensor of maximal size
4) Call tensor-based collectives on the result of (3)
5) Unpickle back into object using torch.load
Note that the API is designed to match other than supporting `async_op`. For now, it is a blocking call. If we see demand to support `async_op`, we will have to make more progress on merging work/future to support this.
If this is a suitable approach, we can support `scatter`, `broadcast` in follow up PRs.
ghstack-source-id: 109322433
Reviewed By: mrshenli
Differential Revision: D22785387
fbshipit-source-id: a265a44ec0aa3aaffc3c6966023400495904c7d8
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.
related RFC is in: https://github.com/pytorch/pytorch/issues/27955
Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068
Differential Revision: D19174838
Pulled By: agolynski
fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62
Summary:
I think this warning isn't true anymore, and the NCCL backend works without PyTorch needing to be built from source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34051
Differential Revision: D20195310
Pulled By: ezyang
fbshipit-source-id: 14f879a8c43ea5efdbdf0f638792ea2b90011f4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434
Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377
Test Plan: Added UT test_tcp_store_timeout_set
Differential Revision: D19935390
fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325
Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.
Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875
Test Plan: Added a UT
Differential Revision: D19871946
fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29059
This is a resubmit of reverted diff D18209289 ( PR #28857 ).
Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo
Reviewed By: pietern
Differential Revision: D18277097
fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28634
caveat 1: this only works in sync mode.
caveat 2: this is going to go away and be replaced by c++ implementation
Test Plan: buck test caffe2/test:distributed_gloo -- test_all_gather_coalesced
Reviewed By: mrshenli
Differential Revision: D18123422
fbshipit-source-id: cfb9950d5d54c6181a5240e7cc9fed88ed47f5d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28226
# Goal
Rendezvous step should be the first step not only for `init_process_group` but also for `init_model_parallel`.
The road block is that there is special step in `init_process_group` where arguments `rank`, `world_size` passed to `init_process_group(..)` are appended to `init_method` url string.
We need to make this argument appending step common and re-usable for both `init_process_group` and `init_model_parallel`.
# Solution
- Put argument appending inside of `rendezvous` function.
- Remove manual `init_method` url construction. Delegate the responsibility to the `rendezvous` function.
- Use the `rendezvous` function for any `RpcAgent`.
Test Plan:
```
buck test mode/dev-nosan caffe2/test:c10d
```
```
buck test mode/dev-nosan caffe2/test:rpc_fork -- test_invalid_names
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_worker_id
```
```
buck test mode/dev-nosan caffe2/torch/fb/distributed/pytorch/tests:test_rpc -- test_sync_rpc
```
```
buck test mode/dev-nosan caffe2/torch/fb/rendezvous:zeus_test
```
```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling -- test_single_trainer_multiple_pss
```
Differential Revision: D5524494
fbshipit-source-id: 50be58ec3c928621b0874b044ef4a1640534d8ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27850
Many of these are real problems in the documentation (i.e., link or
bullet point doesn't display correctly).
Test Plan: - built and viewed the documentation for each change locally.
Differential Revision: D17908123
Pulled By: zou3519
fbshipit-source-id: 65c92a352c89b90fb6b508c388b0874233a3817a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27224
As part of adding error handling to NCCL, we are now able to specify a
timeout for operations using ProcessGroupNCCL. Although, this timeout had a
default of 10 seconds and didn't respect the timeout specified in
init_process_group.
In this change, I've ensured we pass the appropriate timeout to
ProcessGroupNCCL.
ghstack-source-id: 91283548
Test Plan:
Added unit test to verify timeout passed in to init_process_group is
respected.
Differential Revision: D17717992
fbshipit-source-id: c73320187f1f3b2693ba1e177d80646e282d01a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26912
group name is used as prefix in the c10d store and without a consistent name process group cannot be initialized.
When process group doesn't have an explicit name (only WORLD (default) process group can have an explicit name), we use global _group_counter to generate the name. We need to reset the counter on destruction to allow consistent value to be generated when we re-create process groups after some trainers recover from failure.
Test Plan: existing tests passed
Reviewed By: mrshenli
Differential Revision: D17594268
fbshipit-source-id: 17f4d2746584dadaa5d468085d871ff3e95a1c84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25905
Now that we can detect and recover from failures in NCCL we should
allow processes that are started at different times (and perhaps have
had previous NCCL process group instances), to eventually be part of
the same process group. Keeping track of group names in global
variables prevents that, because the processes will be out of sync.
This commit removes the global group name maps and defers
responsibility of isolating access to the same store from multiple
process groups to the store itself. Users can use `c10d::PrefixStore`
to derive new store instances whose keyspace is scoped to some
prefix. Functionally, this is identical to keeping a global map and
using a group name, but also gives more flexibility to the front-end
API to reset state and have processes that have started at different
times to join the same process group.
ghstack-source-id: 89804865
Test Plan: Tests pass.
Differential Revision: D17281416
fbshipit-source-id: eab3b48463a9b0ef24aedeca76e2bb970b9f33ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25575
For both scatter and gather, only the source and destination rank,
respectively, need to supply a list of tensors. The `scatter_list` and
`gather_list` arguments were mandatory, however, and this has resulted
in some confusion. This commit makes both the `scatter_list` and
`gather_list`, and the `src` and `dst` arguments optional.
Closes#25463.
Test Plan: Imported from OSS
Differential Revision: D17164253
fbshipit-source-id: a16bc208c87a1c96163c1a86d4a7ca8634a26f95
Summary:
addresses https://github.com/pytorch/pytorch/issues/21640 for CPU tensors and the Gloo backend.
Questions:
- ~~currently takes `AllreduceOptions`, since all of the options are the same. Would it be better to make a new `AllreduceCoalescedOptions` class?~~
- ~~I decided to inherit from `ProcessGroupGloo::AsyncWork` instead of `AsyncAllreduceWork` to shorten the inheritance chain a bit and for consistency with existing classes. However, this means that the two `getFunction` methods are copy-pasted. Would inheriting from `AsyncAllreduceWork` be preferable?~~
- ~~should the work class be named `AsyncCoalescedAllreduceWork` or `AsyncAllreduceCoalescedWork`?~~
thank you!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24949
Differential Revision: D17055580
Pulled By: mrshenli
fbshipit-source-id: e63b5fcaec6021053ea960776a09ee8cf11d1ec2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033
torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced.
Reviewed By: mrshenli
Differential Revision: D14813070
fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.
This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.
This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809
Differential Revision: D14887937
Pulled By: pietern
fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
Summary:
closes#16520
Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!
Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571
Differential Revision: D13954527
Pulled By: mrshenli
fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595
There is no need to force the backend to be the same as the global
process group, as long as the backend is "nccl" or "gloo".
Reviewed By: mrshenli
Differential Revision: D14657204
fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.
This should enable the following:
* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.
This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.
Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251
Reviewed By: mrshenli
Differential Revision: D14571899
Pulled By: pietern
fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746
as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group.
Reviewed By: aazzolini, soumith
Differential Revision: D13948151
fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6
Summary:
When I wrote the frontend API, it is designed on not letting users use the default_group directly on any functions. It should really be private.
All collectives are supposed to either use group.WORLD, or anything that comes out of new_group. That was the initial design.
We need to make a TODO on removing group.WORLD one day. It exists for backward compatibility reasons and adds lots of complexity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14767
Reviewed By: pietern
Differential Revision: D13330655
Pulled By: teng-li
fbshipit-source-id: ace107e1c3a9b3910a300b22815a9e8096fafb1c
Summary:
These were not enabled after adding support in the Gloo backend. The
argument checks in ProcessGroupGloo raised an error in two cases:
* If the input tensor list to scatter was ``[None]`` on processes other
than the source process.
* If the output tensor list to gather was ``[None]`` on processes other
than the destination process.
This commit prepares these arguments explicitly instead of boxing them
at the process group call site.
This fixes#14536.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14572
Differential Revision: D13272812
Pulled By: pietern
fbshipit-source-id: 12cb0d85ec92f175365cbada585260f89330aad8
Summary:
This fixed two things:
(1) NCCL group doesn't support 2 or more groups, this is because, we need a group name in ProcessGroupNCCL class to keep track of the ProcessGroup ID within that group name, and also the NCCL unique ID within that group name and process group ID. Otherwise, different processes will create different NCCL PG in different orders and can clash on these names. This will fix the NCCL problem.
(2) When using new_group, each rank should enter this function and update its global group name counter to ensure that every rank always operates on the same group name.
With both fixes: repro code in: https://github.com/pytorch/pytorch/issues/14528 should work with both NCCL and Gloo backends.
```
tengli@learnfair096:~$ python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=30000 ~/github_issues/nccl_group.py
rank: 0 - val: 6.0
rank: 2 - val: 6.0
rank: 3 - val: 6.0
rank: 1 - val: 6.0
rank: 4 - val: 22.0
rank: 6 - val: 22.0
rank: 5 - val: 22.0
rank: 7 - val: 22.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14529
Differential Revision: D13253434
Pulled By: teng-li
fbshipit-source-id: 8eb45882b996b06d951fc9a306d5de86a42e8b84
Summary:
Fixing: https://github.com/pytorch/pytorch/issues/14446
This was a supported behavior in old torch.distributed. We want to support it in the new release.
Test should cover all combination of scenario when we have either env or arg set up for rank or size or both
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14494
Differential Revision: D13253433
Pulled By: teng-li
fbshipit-source-id: c05974d84f1bdf969f74ec45763e11a841fe4848
Summary:
This function is only implemented for the subclasses where it makes
sense. If it's not overridden it will throw an error. Having this
function removes the need for a pointer passing hack to pass the
source rank of a recv operation back to the caller. Instead, the
caller can now call `source_rank` on the work object and achieve
the same result.
Closes#11804.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14453
Differential Revision: D13230898
Pulled By: pietern
fbshipit-source-id: ef38f48bfaca8ef9a364e5be122951bafc9f8e49
Summary:
This applies to the gloo backend only. Timeout support for the NCCL and
MPI backends is tracked in issues #14371 and #14372 respectively.
When creating a new process group (either the global one or any subgroup
created through `new_group`) you can specify a timeout keyword
argument (of type datetime.timedelta). This timeout applies to all
collective operations executed against that process group, such that any
operation taking longer than the timeout will throw a runtime error.
Using a different, better catchable error type is tracked in #14433.
This fixes#14376.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14435
Differential Revision: D13234317
Pulled By: pietern
fbshipit-source-id: 973993b67994dc64861c0977cbb6f051ec9d87f6
Summary:
This will address https://github.com/pytorch/pytorch/issues/13574
This error message should be more informative to the user for all the non-multiGPU ops, since we python binding to multi-gpu ops always.
test_distributed should cover all. Also tested both RunTime errors.
```
>>> a = torch.ByteTensor([])
>>> b = [a, a]
>>> dist.all_reduce(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 809, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 207, in _check_single_tensor
"to be a torch.Tensor type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor to be a torch.Tensor type
>>> b = ["b"]
>>> dist.all_gather(b, a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 1006, in all_gather
_check_tensor_list(tensor_list, "tensor_list")
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 225, in _check_tensor_list
"to be a List[torch.Tensor] type".format(param_name))
RuntimeError: Invalid function argument. Expecting parameter: tensor_list to be a List[torch.Tensor] type
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14204
Differential Revision: D13131526
Pulled By: teng-li
fbshipit-source-id: bca3d881e41044a013a6b90fa187e722b9dd45f2
Summary:
Also add docs for get_backend, Backend, and reduce_op
fixes#11803
cc The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11830
Differential Revision: D9927991
Pulled By: SsnL
fbshipit-source-id: a2ffb70826241ba84264f36f2cb173e00b19af48
Summary:
Clean it up from my queue:
https://github.com/pytorch/pytorch/issues/12721
```
>>> torch.distributed.init_process_group(backend="tcp")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 275, in init_process_group
backend = DistBackend(backend)
File "/private/home/tengli/pytorch/torch/distributed/distributed_c10d.py", line 55, in __new__
raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backends for collective operations on CPU tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13596
Differential Revision: D12931196
Pulled By: teng-li
fbshipit-source-id: bb739b107ad7454e2e0a17430087161fedd4c392
Summary:
The existing default timeout was set at 10 seconds, which is too low
for asynchronous tasks that depend on a barrier to resynchronize.
Having a single timeout for all operations is not ideal and this will
be addressed in future commits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13056
Reviewed By: teng-li
Differential Revision: D10558746
Pulled By: pietern
fbshipit-source-id: d857ea55b1776fc7d0baf2efd77951b5d98beabb
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.
cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715
Reviewed By: pietern
Differential Revision: D9889646
Pulled By: SsnL
fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08