Commit Graph

74 Commits

Author SHA1 Message Date
PyTorch MergeBot
314a502eb0 Revert "Reland "[C10] PG observability hooks. (#108815)" (#110907)"
This reverts commit 7678cd22af.

Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this 7678cd22af ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))
2023-10-11 00:23:42 +00:00
Will Constable
7678cd22af Reland "[C10] PG observability hooks. (#108815)" (#110907)
This reverts commit ff0358b038.

(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)

Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
2023-10-10 20:09:40 +00:00
PyTorch MergeBot
ff0358b038 Revert "[C10] PG observability hooks. (#108815)"
This reverts commit 0c7a877745.

Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))
2023-10-06 19:49:49 +00:00
Rodrigo Kumpera
0c7a877745 [C10] PG observability hooks. (#108815)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.

The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.

This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:

    register_collective_start_hook
    register_collective_end_hook
    register_process_group_hook

The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.

The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.

Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-10-06 18:52:46 +00:00
Rodrigo Kumpera
b26af5d5ac [c10d] Add TCPSTore libuv backend support to c10d rendezvous. (#108284)
This enables libuv under env and tcp urls.

Under env either use the environment variable USE_LIBUV=1
or the url parameter use_lib=1.

Under tcp use the url parameter use_lib=1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108284
Approved by: https://github.com/H-Huang, https://github.com/XilunWu
2023-09-07 21:39:58 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Howard Huang
9165d46b89 DDP + C10D sparse all_reduce changes (#103916) (#104256)
Summary:

reland of https://github.com/pytorch/pytorch/pull/103916

## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D47056695

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256
Approved by: https://github.com/rohan-varma
2023-06-28 00:37:52 +00:00
PyTorch MergeBot
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
Howard Huang
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
Jon Maltiel Swenson
0da38409a0 [gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478)
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.

Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).

Test Plan:
Added new unit test:
  buck2 test caffe2/test/distributed:store

Differential Revision: D46622317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
2023-06-16 17:15:56 +00:00
Pritam Damania
9a2df0a5af [RFC] Add method to DDP to check for backward finalization. (#100773)
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.

As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.

Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
2023-05-31 20:43:06 +00:00
Xuehai Pan
05f6250815 Add missing torch.distributed.ReduceOp.AVG in type stubs (#101534)
Add missing `AVG` to `torch.distributed.ReduceOp` enum for type annotation.

Ref:

88b6a4577b/torch/csrc/distributed/c10d/Types.hpp (L35-L47)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101534
Approved by: https://github.com/Skylion007
2023-05-16 15:51:21 +00:00
Eddie Yan
c4f81cb6f4 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-14 02:03:33 +00:00
PyTorch MergeBot
1149ba5553 Revert "[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)"
This reverts commit a33eac3988.

Reverted https://github.com/pytorch/pytorch/pull/95715 on behalf of https://github.com/PaliC due to This pr has caused a regression on distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess causing it to timeout (https://hud.pytorch.org/failure/distributed%2Ftest_dynamo_distributed.py%3A%3ATestMultiProc%3A%3Atest_ddp_baseline_aot_eager_multiprocess)
2023-04-12 21:15:49 +00:00
Eddie Yan
a33eac3988 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-12 18:33:10 +00:00
Pritam Damania
e20e5f5578 [RFC] Add an API to remove autograd hooks from DDP (#96490)
Summary:
When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time.

As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP

Test Plan:
Unit test added

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-03-21 02:56:16 +00:00
Xuehai Pan
1fd119948e [3/3] Update .pyi Python stub files and enable 'UFMT' linter (#95268)
Changes:

- #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- => this PR: #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268
Approved by: https://github.com/huydhn
2023-03-01 23:50:56 +00:00
Howard Huang
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
Masaki Kozuki
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
Charlie Yan
62988e4fe6 Update _distributed_c10d.pyi (#88088)
Summary: `_distributed_c10d.pyi` is out of sync with the C++ binding. This change updates it.

Test Plan: TBD

Differential Revision: D40840836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88088
Approved by: https://github.com/wanchaol
2022-11-01 13:51:06 +00:00
PyTorch MergeBot
641d8e0e69 Revert "Enable mypy check for distributed.py, and fix type errors (#87543)"
This reverts commit 2cc624cd43.

Reverted https://github.com/pytorch/pytorch/pull/87543 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-10-28 02:20:25 +00:00
Charlie Yan
2cc624cd43 Enable mypy check for distributed.py, and fix type errors (#87543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87543
Approved by: https://github.com/fduwjj
2022-10-27 00:22:54 +00:00
Masaki Kozuki
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
PyTorch MergeBot
0e639ff45c Revert "Cleanup PT-D imports (#85781)"
This reverts commit 9a170b24f6.

Reverted https://github.com/pytorch/pytorch/pull/85781 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-10-07 14:55:44 +00:00
Dennis van der Staay
9a170b24f6 Cleanup PT-D imports (#85781)
Summary:
The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.

Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...

TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.

Test Plan: existing unit tests

Differential Revision: D39842273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
Approved by: https://github.com/mrshenli
2022-10-07 00:29:32 +00:00
Howard Huang
586832ce65 Add underlying_store property for PrefixStore (#84640)
Add a property to `PrefixStore` to retrieve the underlying store it is wrapping around. Open for suggestions on property name. This change is based on discussion in [D39225101](https://www.internalfb.com/diff/D39225101) where we need to read properties of the store that PrefixStore is wrapping around.

Differential Revision: [D39311151](https://our.internmc.facebook.com/intern/diff/D39311151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84640
Approved by: https://github.com/xush6528
2022-09-07 23:02:47 +00:00
Howard Huang
e14f46f9dd Add host and port to TCPStore pyi definition (#84636)
`host` and `port` are already exposed in the `TCPStore` pybind definition, this is a small change adding it in the pyi stub

Differential Revision: [D39311153](https://our.internmc.facebook.com/intern/diff/D39311153)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84636
Approved by: https://github.com/wz337
2022-09-07 20:04:14 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
Terry Lam
54bdaf76d6 [PFC] Native UCC process group for Pytorch (#79918)
Summary:
This diff integrates UCC process group as a native component of Pytorch Distributed core. It is based on the existing torch-ucc (https://github.com/facebookresearch/torch_ucc) as the wrapper for UCC collective communication library.
The environment and cmake variables are named in mirroring to the existing process groups such as NCCL and Gloo. Specifically,
- USE_UCC: enables UCC PG. This defaults to OFF, so there is no breakage of existing builds that do not have UCX/UCC external libraries.
- USE_SYSTEM_UCC: uses external UCX and UCC shared libraries that are set accordingly with UCX_HOME and UCC_HOME.

Currently, this diff only supports USE_SYSTEM_UCC=ON, i.e., requiring users to specify external libraries for UCX and UCC. In subsequent diffs, we will add UCX and UCC repos as third-party dependencies in pytorch/third-party.

Test Plan:
Passed Torch-UCC tests that invoke UCC process group. For example:

$ sh test/start_test.sh test/torch_allreduce_test.py --backend gloo --use-cuda
...
Test allreduce: succeeded

Differential Revision: D36973688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79918
Approved by: https://github.com/kwen2501, https://github.com/kingchc
2022-07-12 14:45:44 +00:00
Ryan Chun
96fb63bc75 [Bootcamp] Set default value of TCPStore world_size to None in pybind definition (#77277)
- Set the default value of TCPStore optional world_size arg to None
- Updated the corresponding test case

Fixes #74488
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77277
Approved by: https://github.com/H-Huang
2022-05-12 18:48:48 +00:00
Rohan Varma
ffb0946504 Generalize param verification and broadcast
New PR for https://github.com/pytorch/pytorch/pull/75970 to be compatible with GHF.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76374
Approved by: https://github.com/awgu
2022-04-26 22:25:53 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Rohan Varma
cb14a258a2 [c10d] Fix object-based collectives for debug mode (#68223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68223

DETAIL debug mode didn't work with object-based collectives for NCCL backend, because we'd only check if backend is NCCL and then move tensors to CUDA.

Instead, check if it is a wrapped PG, and then check the pg that is wrapped to see if its nccl.
ghstack-source-id: 143242023

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D32366840

fbshipit-source-id: be0a2af6849f8f24446593f4a4fbea4a67586ee5
2021-11-13 04:18:31 -08:00
Rohan Varma
bff64e84cd [DDP] Track models with sync bn (#66680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680

Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D31679477

fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
2021-10-18 22:31:52 -07:00
Jessica Choi
158b8bdc8a Cleaning up DDP SPMD in reducer.cpp (#64113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113

Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.

Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30615965

fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
2021-09-21 16:13:18 -07:00
Kiuk Chung
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
Sean Lawlor
34c9f5a8da [DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662

Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.

Reviewed By: SciPioneer

Differential Revision: D30012869

fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
2021-08-04 09:27:31 -07:00
Yi Wang
db071ef005 [Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592

Reland #62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352

Test Plan: unit test

Reviewed By: andwgu

Differential Revision: D30049431

fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
2021-08-02 16:38:09 -07:00
Eli Uriegas
6f95850127 Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class
Test Plan: revert-hammer

Differential Revision:
D30024161 (29c8b1db57)

Original commit changeset: 07e6072a2f7b

fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de
2021-08-02 10:26:54 -07:00
Qing Hu
29c8b1db57 [DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510

`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.

Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.

Reviewed By: SciPioneer

Differential Revision: D30024161

fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
2021-08-02 09:33:32 -07:00
Rohan Varma
1f2b96e7c4 [DDP] Make compute_bucket_assignment_by_size return per bucket sizes (#62231)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231

`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.

This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29919056

fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
2021-07-28 20:21:01 -07:00
Rohan Varma
04bd9d7577 [DDP] Add API to get model parameters in hook (#61637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61637

To support running optimizer as a communication hook, add API to
retrieve the model parameters.

The API returns a `dict[idx -> tensor]` where `idx` is the intra bucket index of gradient tensor and thus the same index of `perParameterTensors`. The API can be used as follows to retrieve the model parameters:

```
per_param_grad_tensors = bucket.get_per_parameter_tensors()
idx_to_model_params = bucket.get_grad_index_to_variable_mapping()
for grad_tensor_idx, model_param in idx_to_model_params.items():
    self.assertEqual(model_param.grad, per_param_grad_tensors[grad_tensor_idx])
```

This provides a way for comm. hook developer to retrieve model parameters within a hook. In the next diffs, we will use this to run optimizer as a DDP comm. hook.
ghstack-source-id: 133768666

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29691418

fbshipit-source-id: 4bfa824768a5850f73ee330017e2bcc29ceb7edc
2021-07-19 11:24:54 -07:00
Rohan Varma
8162439cbd [DDP] Remove python GradBucket construction (#60301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60301

`GradBucket` is not meant to be constructed by Python user, only
consumed as part of comm. hook
ghstack-source-id: 131860243

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D29239320

fbshipit-source-id: f1631a16e7d66b7e4a9b4a44698e2319005d10b2
2021-06-23 10:05:34 -07:00
Neel Pragnesh Gandhi
2c5db9a40a Add c10d filestore functionality to the current c10d_rendezvous_backend (#59719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719

Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.

Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.

Reviewed By: cbalioglu, H-Huang

Differential Revision: D28997436

fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
2021-06-16 12:13:36 -07:00
Liang Luo
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
Rohan Varma
c03cae49fc [DDP] Remove unused initialize_buckets (#59066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59066

Per title
ghstack-source-id: 130338812

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28734666

fbshipit-source-id: 89ca7f8e625c4068ba0ed9800be2619e469ae515
2021-06-02 17:20:22 -07:00
Rohan Varma
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
Howard Huang
149000c3f0 Update compare_set docs (#57203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57203

Update documentation to remove warning. Refactored arguments from `old_value` -> `expected_value` and `new_value` -> `desired_value`

Test Plan: Imported from OSS

Reviewed By: gchanan, cbalioglu

Differential Revision: D28076556

Pulled By: H-Huang

fbshipit-source-id: 5fcc5bcfff89cad51d8dc0b74a234964f1af20ed
2021-04-29 13:58:57 -07:00