This reverts commit ff0358b038.
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.
Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).
Test Plan:
Added new unit test:
buck2 test caffe2/test/distributed:store
Differential Revision: D46622317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.
As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.
Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
Summary:
When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time.
As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP
Test Plan:
Unit test added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
Changes:
- #95200
1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.
- #95267
3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:
`namedtuple + __annotations__`:
```python
PackedSequence_ = namedtuple('PackedSequence_',
['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])
# type annotation for PackedSequence_ to make it compatible with TorchScript
PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
'sorted_indices': Optional[torch.Tensor],
'unsorted_indices': Optional[torch.Tensor]}
```
`Namedtuple`: Python 3.6+
```python
class PackedSequence_(NamedTuple):
data: torch.Tensor
batch_sizes: torch.Tensor
sorted_indices: Optional[torch.Tensor]
unsorted_indices: Optional[torch.Tensor]
```
- => this PR: #95268
4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268
Approved by: https://github.com/huydhn
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330
### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.
### Changes
#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.
#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`
### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig
### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)
# Example
This is a basic script (using 2 backends within a process group)
```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os
if __name__ == "__main__":
rank = os.environ.get("RANK")
# initialize with both gloo and nccl
dist.init_process_group()
# with gloo
dist.all_reduce(torch.tensor([1.0]))
print(f"Rank {rank} finished")
# with nccl
dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```
Test Plan: Imported from OSS
Differential Revision: D42069829
Pulled By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
Summary:
The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.
Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...
TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.
Test Plan: existing unit tests
Differential Revision: D39842273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
Approved by: https://github.com/mrshenli
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.
Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.
Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:
$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded
Differential Revision: D36973688
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223
DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.
Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D32366840
fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680
Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D31679477
fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113
Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.
Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`
Imported from OSS
Reviewed By: mrshenli
Differential Revision: D30615965
fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910
Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:
```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```
An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.
For details see: https://github.com/pytorch/pytorch/issues/63874.
This change does a couple of things:
1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths
NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.
Test Plan: Unittests.
Reviewed By: cbalioglu
Differential Revision: D30529984
fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592
Reland #62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352
Test Plan: unit test
Reviewed By: andwgu
Differential Revision: D30049431
fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.
Reviewed By: SciPioneer
Differential Revision: D30024161
fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231
`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.
This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D29919056
fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61637
To support running optimizer as a communication hook, add API to
retrieve the model parameters.
The API returns a `dict[idx -> tensor]` where `idx` is the intra bucket index of gradient tensor and thus the same index of `perParameterTensors`. The API can be used as follows to retrieve the model parameters:
```
per_param_grad_tensors = bucket.get_per_parameter_tensors()
idx_to_model_params = bucket.get_grad_index_to_variable_mapping()
for grad_tensor_idx, model_param in idx_to_model_params.items():
self.assertEqual(model_param.grad, per_param_grad_tensors[grad_tensor_idx])
```
This provides a way for comm. hook developer to retrieve model parameters within a hook. In the next diffs, we will use this to run optimizer as a DDP comm. hook.
ghstack-source-id: 133768666
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29691418
fbshipit-source-id: 4bfa824768a5850f73ee330017e2bcc29ceb7edc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60301
`GradBucket` is not meant to be constructed by Python user, only
consumed as part of comm. hook
ghstack-source-id: 131860243
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29239320
fbshipit-source-id: f1631a16e7d66b7e4a9b4a44698e2319005d10b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719
Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.
Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.
Reviewed By: cbalioglu, H-Huang
Differential Revision: D28997436
fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59066
Per title
ghstack-source-id: 130338812
Test Plan: ci
Reviewed By: SciPioneer
Differential Revision: D28734666
fbshipit-source-id: 89ca7f8e625c4068ba0ed9800be2619e469ae515
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224
Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.
Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.
This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.
Once all of this passes we simply dispatch the collective to the underlying pg.
Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687
Test Plan: ci
Reviewed By: zhaojuanmao
Differential Revision: D28023981
fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64