Commit Graph

84 Commits

Author SHA1 Message Date
Howard Huang
bc66ddb5cb Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`

Sample script to demonstrate new error type

```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py

import torch
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group("nccl")
    dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```

Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
2022-11-08 13:26:42 +00:00
Charlie Yan
ca2dc8b4e7 [1/n] Thread PG: fix pyre error of class ProcessGroup (#88281)
Summary: Fix the typing stub of `ProcessGroup` in "torch/distributed/__init__.py", so that it won't confuse pyre, and we can remove a lot of pyre suppression comments.

Test Plan: pyre check

Differential Revision: D40921667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88281
Approved by: https://github.com/wanchaol
2022-11-02 23:02:08 +00:00
Iris Zhang
0cf572ff6c [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

1. Add a decorator function exception_handlers to  c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

```
python3 test/distributed/test_c10d_error_logger.py
```

Test Plan: Test in OSS.

Reviewed By: H-Huang

Differential Revision: D40281632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
2022-10-29 04:38:34 +00:00
PyTorch MergeBot
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2b.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
Rodrigo Kumpera
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
anjali411
e2a4dfa468 Add correct __all__ for torch.distributed and torch.cuda submodules (#85702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85702
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/rohan-varma
2022-10-10 19:15:24 +00:00
PyTorch MergeBot
0e639ff45c Revert "Cleanup PT-D imports (#85781)"
This reverts commit 9a170b24f6.

Reverted https://github.com/pytorch/pytorch/pull/85781 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-10-07 14:55:44 +00:00
Dennis van der Staay
9a170b24f6 Cleanup PT-D imports (#85781)
Summary:
The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.

Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...

TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.

Test Plan: existing unit tests

Differential Revision: D39842273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
Approved by: https://github.com/mrshenli
2022-10-07 00:29:32 +00:00
Edward Z. Yang
61b4e8a7bf More SymFloat support (#85411)
- Support storing SymFloat in IValue
- Add SymFloat to JIT type system (erases to float)
- Printing support for SymFloat
- add/sub/mul/truediv operator support for SymFloat
- Support truediv on integers, it returns a SymFloat
- Support parsing SymFloat from Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85411
Approved by: https://github.com/albanD
2022-09-22 08:07:22 +00:00
Rodrigo Kumpera
38192f63cd Add __all__ for a few distributed modules plus a little typing (reland) (#84872)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

This relands #84119 with the required fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84872
Approved by: https://github.com/rohan-varma
2022-09-13 21:57:49 +00:00
PyTorch MergeBot
219ff26172 Revert "Add __all__ for a few distributed modules plus a little typing (#84119)"
This reverts commit 6f21680563.

Reverted https://github.com/pytorch/pytorch/pull/84119 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D39386448
2022-09-09 20:01:07 +00:00
Rodrigo Kumpera
6f21680563 Add __all__ for a few distributed modules plus a little typing (#84119)
This handles distributed_c10d, which is massive and ddp_comm_hooks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84119
Approved by: https://github.com/rohan-varma
2022-09-08 23:28:31 +00:00
Masaki Kozuki
ab6c57217a Add NCCL PreMul Sum to c10d redce ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler

## TODO
- [ ] Update abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73) to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.

cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501
2022-09-02 21:57:45 +00:00
PyTorch MergeBot
1f61c39ac4 Revert "Support NCCL Premul Sum (#81272)"
This reverts commit 432c508e71.

Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-08-25 05:01:37 +00:00
Masaki Kozuki
432c508e71 Support NCCL Premul Sum (#81272)
This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum.

The major changes include
- convert enum ReduceOp to struct
- add premul sum specific paths to init.cpp and Ops.cpp.

note:
- For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed

The commit titled "add nccl premul" whose current hash is cb99ad6744 was authored by @mcarilli and @ptrblck.

cc @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272
Approved by: https://github.com/kwen2501
2022-08-24 04:53:25 +00:00
Howard Huang
f76d1c022e [Dynamic RPC] Allow for optional world_size argument in init_rpc (#73372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372

This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time.
This PR does not yet enable RPC commands between ranks.

Previously:
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", world_size=1, rank=0)
```

Now (only rank is needed):
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", rank=0)
```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34621651

Pulled By: H-Huang

fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0
(cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)
2022-03-24 16:19:28 +00:00
Can Balioglu
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
Shen Li
58fefa6516 Add pybind trampoline for ProcessGroup and Work (#66338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66338

This commit exposes c10d extension API to Python land. Users can
now override c10d communication behaviors in pure Python, and no
longer needs to go through the cpp extension steps.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D31514351

Pulled By: mrshenli

fbshipit-source-id: a8b94af0af7960c078e1006c29b25f7f3bd86c81
2021-10-11 06:41:06 -07:00
Jessica Choi
158b8bdc8a Cleaning up DDP SPMD in reducer.cpp (#64113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113

Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.

Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D30615965

fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
2021-09-21 16:13:18 -07:00
Pritam Damania
b8e6144e0a Add a _RemoteDevice structure for ShardedTensor/ShardingSpec. (#62927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927

As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.

Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.

Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id: 135534086

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30170689

fbshipit-source-id: 1ac2e81c7a597dc40bf3fbf2c1168c382c66649f
2021-08-11 11:27:32 -07:00
Yi Wang
48ea7c808d [C10d] Support subgroups (#59111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111

Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.

Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.

Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.

#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed

Reviewed By: rohan-varma

Differential Revision: D28495672

fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
2021-06-09 22:35:11 -07:00
Liang Luo
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
Rohan Varma
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
Liang Luo
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
Aliaksandr Ivanou
8f663170bd [17/n][torch/elastic] Make torchelastic launcher compatible with the caffe2.distributed.launch (#55687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687

The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env

The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0

The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.

The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test

Reviewed By: cbalioglu, wilson100hong

Differential Revision: D27643206

fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
2021-04-14 19:33:26 -07:00
Sam Estep
4753100a3b Un-ignore F403 in .flake8 (#55838)
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html

This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).

This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838

Test Plan: CI. You can also run `flake8` locally.

Reviewed By: jbschlosser

Differential Revision: D27724232

Pulled By: samestep

fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
2021-04-13 09:24:07 -07:00
Yi Wang
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
Richard Barnes
e5634f5f25 More types for torch (#54037)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54037

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27054755

fbshipit-source-id: f21985e201b35bdb83269595cdcf5e1e64837e52
2021-03-27 08:57:15 -07:00
Rohan Varma
bdbfc2582d [Dist Debugality] Log key DDP metrics to stderr under debug mode. (#52957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52957

This diff:
1. Under TORCH_DISTRIBUTED_DEBUG=INFO or DETAIL, logs DDP information during init time (all stats in ddp_logging_data_)
2. Under TORCH_DISTRIBUTED_DEBUG=DETAIL, logs runtime stats when they are collected (first 10 iterations and then once every 100 iterations). Avoiding logging every iteration to not spam logs.

Verified by inspecting logs:

```
I0226 19:12:47.109243 2818475 logger.cpp:140] [Rank 1]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 1 output_device: 1 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
I0226 19:12:47.109252 2818473 logger.cpp:140] [Rank 0]: DDP Initialized with:
world_size: 2 module_name: Linear device_ids: 0 output_device: 0 backend_name: nccl parameter_dtype: float total
_parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer
s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0
 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo
ut: N/A
```

```
I0226 19:12:48.117936 2818473 logger.cpp:286] [Rank 0 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 568944
 Avg backward compute time: 885504
Avg backward comm. time: 692496
 Avg backward comm/comp overlap time: 113536
I0226 19:12:48.118517 2818475 logger.cpp:286] [Rank 1 / 2] Training Linear unused_parameter_size=0
 Avg forward compute time: 565584
 Avg backward compute time: 876992
Avg backward comm. time: 201872
 Avg backward comm/comp overlap time: 128624
```
ghstack-source-id: 123171875

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26708184

fbshipit-source-id: 16defd5610d28bc4cf3fc2a0cc564e84efcfa791
2021-03-05 11:23:18 -08:00
Rohan Varma
68134374cb Refactor/fix DDP model check during init (#52887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887

This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.

There were a few things that could be improved with the way we verify model across processes in DDP initialization:

1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP

In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877

Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks

Reviewed By: zhaojuanmao

Differential Revision: D26565290

fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
2021-03-05 11:21:45 -08:00
Yi Wang
68b62493b8 [Gradient Compression] Make GradBucket class public (#53099)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53099

Publish GradBucket APIs for publishing DDP communication hooks.

s/_GradBucket/GradBucket
ghstack-source-id: 123030921

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D26721121

fbshipit-source-id: ee5f68e33095b9965b51937b86cdeb331fd2419a
2021-03-03 19:22:15 -08:00
Rohan Varma
7cfe140705 Add distributed debug mode func to python (#52481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52481

Adds an API `get_debug_mode` that can be used by distributed package and users to retrieve debug mode. Currently no functionality changes, but wanted to get the bare bones function out and add relevant debug mode logging in follow up diffs.
ghstack-source-id: 122471216

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D26508972

fbshipit-source-id: d1153774f8697bc925a05db177d71c0566d25344
2021-02-25 22:35:55 -08:00
Yanli Zhao
18e0a61388 add more logging fields that can be set in construction time (#51260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260

add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type
ghstack-source-id: 121260224

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D26118245

fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9
2021-02-09 21:58:58 -08:00
Yanli Zhao
250c71121b Create a DDPLoggingData and expose it to python interface (#50622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
2021-01-25 15:23:07 -08:00
Joe Zhu
92f376147c Enable TCPStore on Windows (#47749)
Summary:
Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance.

Related RFC is https://github.com/pytorch/pytorch/issues/47659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749

Reviewed By: bdhirsh

Differential Revision: D25220401

Pulled By: mrshenli

fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529
2020-12-03 08:32:01 -08:00
Xu Zhao
73a3e70b24 Add type annotations for torch._C._distributed_c10d module. (#46623)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46623

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D24761606

Pulled By: xuzhao9

fbshipit-source-id: 827eaf2502e381ee24d36741c1613b4c08208569
2020-11-06 01:28:48 -08:00
Mingzhe Li
281463ba0b [NCCL] Enable send/recv tests (#45994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994

Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569

Test Plan: waitforsandcastle

Reviewed By: jiayisuse

Differential Revision: D24172484

fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
2020-10-09 15:00:39 -07:00
Mingzhe Li
59083d6176 [NCCL] Support NCCL Send/Recv (#44921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785

Test Plan: unittest

Reviewed By: jiayisuse

Differential Revision: D23709848

fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
2020-10-05 18:27:57 -07:00
gunandrose4u
47debdca42 Document change for DDP enabled on Windows platform (#45392)
Summary:
Document change for DDP enabled on Windows platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45392

Reviewed By: gchanan

Differential Revision: D23962344

Pulled By: mrshenli

fbshipit-source-id: 8924c6ca36d68699871d8add3e0aab6542ea269c
2020-09-28 13:22:42 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Shen Li
b982a6a247 Expose torch.distributed.is_available() API (#37021)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37021

Test Plan: Imported from OSS

Differential Revision: D21164318

Pulled By: mrshenli

fbshipit-source-id: 08a446af342cbe54f3eb4994956ffa7ef4922bcf
2020-04-21 18:38:46 -07:00
Pieter Noordhuis
b4df413712 Scope pybind11 functions to torch.distributed.{autograd,rpc}
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27529

Test Plan: Imported from OSS

Differential Revision: D17808209

Pulled By: pietern

fbshipit-source-id: 1e3e086085167320c3fc369467f5d75ce39fa4ea
2019-11-05 06:25:22 -08:00
なるみ
d83389d327 Ignore F401 in all __init__.py without putting noqa (#25823)
Summary:
By adding `per-file-ignores = __init__.py: F401` into `.flake8` with `flake8>=3.7`, we can ignore F410 in all `__init__.py` without putting `# noqa: F401` line by line.

http://flake8.pycqa.org/en/latest/user/options.html?highlight=per-file-ignores#cmdoption-flake8-per-file-ignores
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25823

Differential Revision: D17252182

Pulled By: soumith

fbshipit-source-id: 87b174075b79e4078953a7521bd1a8f82405646b
2019-10-23 15:28:13 -07:00
Pieter Noordhuis
b4ce922b58 Move RPC API to torch.distributed.rpc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27290

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D17808212

Pulled By: pietern

fbshipit-source-id: c79907940fe4888b2ceaaa1cda0078e39c89b454
2019-10-08 11:31:25 -07:00
Shen Li
2486b0ba82 Add Python RRef as args and return value (#25499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
2019-10-03 17:47:12 -07:00
Shihao Xu
00e588290b Add test case for init_rpc_backend (#26997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26997

Reverting accidental change in https://github.com/pytorch/pytorch/pull/26919
ghstack-source-id: 91126906

Reviewed By: zhaojuanmao

Differential Revision: D17637468

fbshipit-source-id: 9ffcf4b15b37effe6b5d5f82338ff89298c82a52
2019-10-01 15:44:34 -07:00
Shihao Xu
b0b0f2c65f Make ProcessGroupAgent take num_send_recv_threads as constructor argument (#26313)
Summary:
# Problem

If there is not enough number of thread in the RPC Agent thread pool. Some circular dependent works could cause deadlock.

The current to way to get around this deadlock is to provide abundant number of threads.

# Solution

as titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26313

Differential Revision: D17405491

Pulled By: xush6528

fbshipit-source-id: a1d9b6a84db0371cd4b63328fa00f651c0808485
2019-09-18 10:36:29 -07:00
Satendra Gera
e09c5e69f4 Dynamic registration of RPC backends (#25734)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25734

[pytorch] Dynamic registration of RPC backends
Allow non-pg rpc backends to be plugged in as a backend.
ghstack-source-id: 89938296

Differential Revision: D17183789

fbshipit-source-id: 885fed12d80b82b60f9a125f78302a161e708089
2019-09-11 21:48:44 -07:00
Shen Li
c881136215 Move worker name collection code from Python to C++ (#24260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24260

This also simplifies ProcessGroupAgent constructor signature.

Test Plan: Imported from OSS

Differential Revision: D16789219

Pulled By: mrshenli

fbshipit-source-id: bbb69022435467fbb1c28da21dd03d3ab52fc521
2019-08-31 19:02:45 -07:00
Shen Li
1294e55c15 Assign each RpcAgent a unique ID, and use ID for sending RPC messages. (#24195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195

It is not efficient to use a string destination name in every
send. Moreover, when we add RRef later, RpcAgent will frequently check
RRef ownership. It will be slow as well if we have to go though string
comparison every time. This commit assigns each RpcAgent a unique
integer ID. In the Python send API, applications can provide either
destination name or id. If it is a string name, it will be converted to
id by calling the get_id(workerName) API.

Test Plan: Imported from OSS

Differential Revision: D16770241

Pulled By: mrshenli

fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d
2019-08-29 19:19:11 -07:00