Commit Graph

1403 Commits

Author SHA1 Message Date
Yi Wang
459270ac01 [Gradient Compression] Apply division first to avoid overflow (#59522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28922548

fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
2021-06-07 01:43:10 -07:00
Yi Wang
3137bbeb1a [Reland][DDP] Merge work and future_work in reducer (#59520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130685351

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view

Reviewed By: walterddr

Differential Revision: D28922305

fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c
2021-06-06 09:49:08 -07:00
Can Balioglu
1d9c1cc00a [4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331

This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair.
ghstack-source-id: 130676394

Test Plan:
- Run the existing and newly-introduced tests.
- Run several smoke tests including the short code snippet referred in GitHub issue #41614.

Reviewed By: H-Huang

Differential Revision: D28453850

fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6
2021-06-05 07:50:07 -07:00
Can Balioglu
844a98758a [3/n] [c10d] Revise the implementation of TCPStore (#58330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58330

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a major refactoring of the `TCPStore` class in preparation of the multi-tenancy feature.

- All TCP sockets are wrapped with a new `TCPSocket` RAII type.
- `BackgroundThread` and daemon types are moved from header to cpp file.
- Server, client, and callback sockets are refactored into their own internal types `TCPServer`, `TCPClient` and `TCPCallbackClient`.
- Calls to `tcputil::send*` and `tcputil::recv*` are wrapped in `TCPClient` for easier readability and maintenance purposes.
- Two `TODO` statements are put to reference future improvements. Based on feedback, I will either create separate GitHub issues for them or address them as part of this stack.
ghstack-source-id: 130676392

Test Plan: Run the existing tests since there are no user-facing behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28448981

fbshipit-source-id: 415b21e74b3cd51d673c1d5c349c6a2cb21dd667
2021-06-05 07:50:06 -07:00
Can Balioglu
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
Can Balioglu
cf408c3743 [1/n] [c10d] Introduce a new TCPStore constructor (#58328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58328

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a new `TCPStore` constructor that takes its optional parameters via a newly introduced `TCPStoreOptions` structure. This gives the API callers the flexibility to specify only the desired options while skipping the rest.

The main motivation behind this change is the introduction of the `multiTenant` constructor option in the second PR of this stack.
ghstack-source-id: 130676384

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28417742

fbshipit-source-id: e6ac2a057f7ad1908581176ee6d2c2554c3c74a9
2021-06-05 07:50:02 -07:00
Rong Rong (AI Infra)
c88a0b55b3 Revert D28677383: [DDP] Merge work and future_work in reducer
Test Plan: revert-hammer

Differential Revision:
D28677383 (f8bebade47)

Original commit changeset: 85e0620378b7

fbshipit-source-id: ef3c65b88c375aa9a6befe2ab004ec37ae7eb587
2021-06-05 07:25:44 -07:00
Yi Wang
f8bebade47 [DDP] Merge work and future_work in reducer (#58937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58937

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130673249

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs

Reviewed By: agolynski

Differential Revision: D28677383

fbshipit-source-id: 85e0620378b7e9d837e436e94b9d807631d7d752
2021-06-05 01:18:30 -07:00
Alexander Golynski
1183fa3817 Switch PG::Work to Future in default_comm_hooks.cpp (#59398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59398

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28876182

Pulled By: agolynski

fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e
2021-06-04 15:27:13 -07:00
Liang Luo
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
Rohan Varma
332b01e93f [DDP] log usage of torch_distributed_debug (#59351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59351

Logging PT distributed debug level to track usage internally.
ghstack-source-id: 130443122

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28854914

fbshipit-source-id: a8e85ca4a3c9ac2f18d13190e87c0ebc4a8e7ea2
2021-06-03 11:49:23 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
Rohan Varma
79aeca0b00 [DDP] Log when errors happen (#59281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28652717

fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
2021-06-02 19:48:26 -07:00
Rohan Varma
1968efa2dd [c10d] Remove verbose log (#59070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59070

This log is too verbose, especially in the case we call monitored
barrier before every collective as we do in ProcessGroupWrapper.
ghstack-source-id: 130052822

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28738189

fbshipit-source-id: f2899537caa4c13508da31134d5dd0f4fd6a1f3a
2021-06-02 13:50:11 -07:00
Michael Suo
b977a3b66d [c10d] Split custom class bindings out of python binding code (#58992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58992

Currently, we define Torchbind custom classes in the same place that we define Python bindings.

This is nice from a code location perspective, but has two downsides:
1. These custom classes are not available in a C++-only build.
2. These break when included in torch::deploy.

Some explanation on the second issue: torch::deploy creates many Python
interpreters, and creates a full copy of all the bindings for each one. This
will run the static initialization code once for each copy of the bindings,
leading to multiple registration of the custom classes (and therefore an
error).

This PR splits out the relevant custom class binding code into its own source
file to be included in libc10d, which can be compiled and statically
initialized a single time and linked against from the c10d python bindings.
ghstack-source-id: 130168942

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D28690832

fbshipit-source-id: 3c5e3fff28abb8bcdb4a952794c07de1ee2ae5a8
2021-05-28 15:35:23 -07:00
Nikita Shulga
0e9a295b41 Refactor GlooDeviceFactory::makeDeviceFor... (#58996)
Summary:
`makeDeviceForHostname` and `makeDeviceForInterface` are almost
duplicate except for different default argument values

Create generic `makeGlooDevice` anonymous function that takes both host
name and interface name and call it from both
makeDeviceFor[Hostname|Interface]

Also solve two other minor issues:
 - do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load
   time
 - Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996

Reviewed By: pbelevich

Differential Revision: D28713324

Pulled By: malfet

fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d
2021-05-26 20:33:11 -07:00
Rohan Varma
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
Rohan Varma
76ce925257 [c10d] Fix monitored_barrier with wait_all_ranks (#58702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58702

Off by one error when determining if some ranks failed or not with
`wait_all_ranks=True`. This wasn't caught by tests because the tests only
tested failure scenarios, not success scenarios with `wait_all_ranks=True`.
ghstack-source-id: 129559840

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28583235

fbshipit-source-id: a8f376efb13a3f36c788667acab86543c80aff59
2021-05-21 09:40:50 -07:00
Rohan Varma
b301558410 [Reducer] Remove replica size == 1 checks (#58603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58603

No longer need these checks
ghstack-source-id: 129498227

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28549893

fbshipit-source-id: a89bf8c3fc3aba311a70fd37e5a6aa5dc14b41b9
2021-05-20 22:34:23 -07:00
Rohan Varma
88c76b43fb [Reducer] move comment to the right place (#58594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58594

This comment was misplaced after some changes, move it to the right
place.
ghstack-source-id: 129498228

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28548100

fbshipit-source-id: a9163fc3b25a9d9b8b6d4bfa2a77af290108fc09
2021-05-20 22:34:17 -07:00
Rohan Varma
d83c5a5c7f Format reducer.cpp, hpp (#58593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58593

Per title
ghstack-source-id: 129498230

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528465

fbshipit-source-id: 89e4bfcb4a0275dc17090a934d4c0a41a3c54046
2021-05-20 22:32:30 -07:00
Rohan Varma
62adf9e1c9 [Reducer] Completely remove VariableIndex (#58592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58592

Completely removes VariableIndex from reducer code, as it is not
needed. replica_index is always 0 so simplify the code to only use the
parameter index. Next, we should also remove all of the nested data structures
that were needed when num_replicas > 1 was possible.
ghstack-source-id: 129498226

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528440

fbshipit-source-id: e0568399264ab4f86de3b7a379a4f0831f8f42e9
2021-05-20 19:47:50 -07:00
Rohan Varma
faa7d3793d [DDP] Support not all outputs used in loss calculation (#57081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081

Changes in this diff:

Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.

Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.

Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28048628

fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
2021-05-20 08:34:33 -07:00
Ching-Hsiang Chu
b9b8522e00 [profile] fix recorded data type (#58531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58531

fix data type of alltoall(v) when recording communication metadata via DebugInfo in NCCL PG

Reviewed By: chaekit

Differential Revision: D28529372

fbshipit-source-id: 2917653f73f5fe4f6dc901803235994ca042bba2
2021-05-19 14:14:54 -07:00
Rohan Varma
1ba05efd26 [Reducer] Remove some unused variables (#58524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58524

Per title
ghstack-source-id: 129311600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28528223

fbshipit-source-id: 239a15de4b602e35ed9b15b8a4bea3c28b61de12
2021-05-19 09:55:04 -07:00
Yanli Zhao
ea0f7c4720 move unused parameters to end of bucket orders when rebuild buckets for static graph (#58097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58097

move unused parameters to end of bucket orders when rebuild buckets for static graph

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D28366689

fbshipit-source-id: fbd224aeb761d5aa3bab35a00d64974eb4455b2e
2021-05-18 16:36:40 -07:00
zhouzhuojie
eab59bae15 Fix cmake_minimum_require in libshm (#58306)
Summary:
Deprecation warning reported by cmake:

```
CMake Deprecation Warning at CMakeLists.txt (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of CMake.
  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.
```

This is the only place that requires bumping min version. There're two others but only in `third_party` folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58306

Reviewed By: bdhirsh

Differential Revision: D28446097

Pulled By: zhouzhuojie

fbshipit-source-id: af5ef50e61bd57dc36089ebe62db70ba0081864c
2021-05-17 09:55:07 -07:00
Yi Wang
581bf01074 [Gradient Compression] Remove unnecessary warning on the rst file and the check on C++ version (#58170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58170

Now comm hook can be supported on MPI and GLOO backends besides NCCL. No longer need these warnings and check.
ghstack-source-id: 128799123

Test Plan: N/A

Reviewed By: agolynski

Differential Revision: D28388861

fbshipit-source-id: f56a7b9f42bfae1e904f58cdeccf7ceefcbb0850
2021-05-12 14:15:10 -07:00
Alexander Golynski
4ef94265e9 Add Futures to ProcessGroupGloo (#57818)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57818

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28304171

Pulled By: agolynski

fbshipit-source-id: dbf7f5538890d138582831aa0279ede89619ea1e
2021-05-11 14:47:09 -07:00
Erjia Guan
d49f6d556b [DataLoader] Fix tempfile binding and removing for torch_shm_manager (#57566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57566

Fix the problem that `tempfile` has never been deleted even after `torch_shm_manager` is destroyed.
- The previous implementation has wrong path length for the Linux Socket. It leads to we lose the last character of the name of `tempfile` when bind the pathname to socket. At the end, we can not delete this file due to unexpected file name.
- After we solve the racing problem by introducing a temporary directory, it becomes more dangerous since it prevents `torch_shm_manager` to delete directory as the tempfile persists in the temporary directory.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28202866

Pulled By: ejguan

fbshipit-source-id: 912cfd8fec0cc309d47df223b2b0faa599c60799
2021-05-11 14:14:58 -07:00
Yanli Zhao
ea421fb249 enable static graph training in DDP (#55248)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248

This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464

Test Plan: unit tests. adding more tests

Reviewed By: rohan-varma

Differential Revision: D27539964

fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
2021-05-11 10:23:25 -07:00
Rohan Varma
5840c8cfd8 [nccl] log rank when communicator is aborted (#57974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57974

We see this error quite a bit in internal workflows, would be useful
to have this additional logging information here.
ghstack-source-id: 128602199

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28331693

fbshipit-source-id: 25398c6a3420a2b594d79aa8f46936cd0addd426
2021-05-10 21:23:31 -07:00
Alexander Golynski
db412a6885 Avoid 2 extra copies when reducing sparse tensors and fix result() vs inplace output discrepancy (#57822)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57822

* `AsyncSparseAllreduceWork` can avoid copying output tensors, since we keep all the results alive by means of modifying input vector directly
* `AsyncSparseAllreduceWork` now returns inputs back to user instead of former behavior where it returned copies of inputs. This is consistent with other operations and process group implementations
* `AsyncSparseAllreduceCUDAWork` is now copying tensors directly from CPU to input tensors avoiding extra copy `output` -> `outputs` -> `inputs`. inputs are being returned to back to user. This is consistent with other operations and process group implementations.

overall AsyncSparseAllreduceCUDAWork is now avoiding 2 extra copies (as AsyncSparseAllreduceCUDAWork is using AsyncSparseAllreduceWork's impl)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28298325

Pulled By: agolynski

fbshipit-source-id: 18e2104413cdf5e73a01aad464e2613807779297
2021-05-07 15:12:58 -07:00
Pavel Belevich
96e1a83fb2 Add Gloo TCP_TLS transport (#56442)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56442

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D27896285

Pulled By: pbelevich

fbshipit-source-id: 589af59ca4c7c9bab2329f079382c09b71cfcf9e
2021-05-07 13:36:11 -07:00
Luca Wehrstedt
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
Jay Chae
1101a5f6e9 [paramcomms] support for in and out split sizes (#57709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57709

NOTE: initial commit got reverted D28247764

Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723 (fc657b547a)

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28248333

fbshipit-source-id: cee523612667cb37170c94e3c40dab5fba432225
2021-05-06 12:04:34 -07:00
Alexander Golynski
dc06f52480 Add result() to ProcessGroupGloo::AsyncWork's (#57565)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57565

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28255120

Pulled By: agolynski

fbshipit-source-id: 1e904d4fe024d5b99cb642f8689ca32be0581e82
2021-05-06 08:48:48 -07:00
Horace He
ccbbb2d6f8 Revert D28052211: [paramcomms] support for in and out split sizes
Test Plan: revert-hammer

Differential Revision:
D28052211 (866b19e95d)

Original commit changeset: 4ab7d425fc72

fbshipit-source-id: 80c001ddcb3730f0487adddf66d9166f53c45a8c
2021-05-05 21:10:31 -07:00
Jay Chae
866b19e95d [paramcomms] support for in and out split sizes
Summary: Adding way to accept in and out split sizes.

Test Plan:
{F613245151}
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces
NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723

UPDATED: now the sizes are encoded as arrays in .json
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces

Reviewed By: kingchc

Differential Revision: D28052211

fbshipit-source-id: 4ab7d425fc722907d9bbcfad7e364d031ff69b29
2021-05-05 20:46:11 -07:00
Rohan Varma
7115a4b870 Clang format ProcessGroupNCCL.cpp (#56840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56840

Per comments in https://github.com/pytorch/pytorch/pull/56427/files
ghstack-source-id: 128142665

Test Plan: Ci

Reviewed By: SciPioneer

Differential Revision: D27980768

fbshipit-source-id: 0158ae1cfd892ff3385ffa0084dd7ef9de014f8c
2021-05-05 10:17:09 -07:00
Rohan Varma
a948e279ac [c10d] Profiler support for nccl p2p collectives (#56427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56427

This PR enables support for nccl send/recv profiling similar to how we have it for MPI and Gloo.

The process to do so is similar to the NCCL collectives where we create the `recordingFunction` in `initWork` and then add a callback that runs the profiler end callbacks. Tests are added similar to send/recv tests with gloo/MPI.

We also test with both autograd profiler and torch.profiler.
ghstack-source-id: 128142666

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27866600

fbshipit-source-id: f29d9103e22b22f658632fece0df9ba36911fc62
2021-05-05 10:14:56 -07:00
Rohan Varma
7175d49122 [Dist profiling] Add is_async field (#57253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (https://github.com/pytorch/pytorch/pull/56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
2021-05-04 17:44:28 -07:00
Alexander Golynski
2b6c09c11e Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28200791

Pulled By: agolynski

fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55
2021-05-04 16:04:45 -07:00
Rohan Varma
375c8a81dc [DDP] Profile search_unused_parameters (#57376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57376

Having this in profiler/trace outputs will be useful when
investigating performance overhead of find_unused_parameters for certain
workloads, to determine whether it is a bottleneck or not.
ghstack-source-id: 127942159

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28126233

fbshipit-source-id: 93082ae5b84e64351d59447a29f97eaf9b0bbd64
2021-05-03 09:41:18 -07:00
Alexander Golynski
f332a8bdff Implement result() function in MPI Work classes (#57168)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57168

Implement result() for MPI which wasn't previously supported.

Some user rely on output args, however in future usecases (e.g. DDP comm hook) we need to return the result explicitly.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28129125

Pulled By: agolynski

fbshipit-source-id: d6abcd2114163471c045043534a0a3377f2579b4
2021-05-03 07:12:46 -07:00
Brad Fish
e68c46bb3a Propagate information on torch_shm_manager execl failure to parent process (#57310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57310

If we fail to exec `torch_shm_manager`, write an appropriate error message to stdout so that the parent process can have some context on the failure.

Reviewed By: ejguan

Differential Revision: D28047917

fbshipit-source-id: 68bf357df7a6b318c036f4f62cbb428a62cb139e
2021-04-30 11:11:09 -07:00
Brad Fish
2c2aa9e030 Address temp file/bind race condition in torch_shm_manager (#57309)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57309

Addressing a race condition that can occur in `torch_shm_manager` between the time its temporary file is unlinked and when it `bind()`s the manager server socket to that same name. In that time window, other threads/processes can re-create another temporary file with the same name, causing `bind()` to fail with `EADDRINUSE`.

This diff introduces `c10::TempDir` and associated helper functions that mirror those of `c10::TempFile` and generates the manager socket name using a combination of a temporary directory, which will be valid for the lifetime of `torch_shm_manager`, and a well-known file name within that directory that will never be used outside of `bind()`.

Reviewed By: ejguan

Differential Revision: D28047914

fbshipit-source-id: 148d54818add44159881d3afc2ffb31bd73bcabf
2021-04-30 11:11:07 -07:00
Brad Fish
7eed5410cd Make c10::TempFile non-copyable but movable (#57308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57308

This diff makes `c10::TempFile` non-copyable but movable. `torch_shm_manager` was previously dependent upon some hidden behavior that was a result of copying `TempFile`s, which is also being made more explicit now that they can be moved but not copied.

Context:

`c10::TempFile` is currently copyable, which leads to surprising behavior. A seemingly valid `TempFile` may in fact be invalid if the original it was copied from has already been destroyed, resulting in the file descriptor to be closed and the filename being unlinked without the user knowing about it.

**In fact, both `c10::try_make_tempfile` and `c10::make_tempfile` cause copies of `TempFile` to be made**, which can easily be verified by explicitly deleting the copy constructor of `TempFile` and attempting to compile. This means that in practice, users of these functions are getting temporary files that have already been closed and unlinked.

This copying of `TempFile` is particularly interesting in the case of `torch_shm_manager`, which uses `try_make_tempfile` to generate the name of a Unix domain socket to communicate with clients. In order for `bind()` on the socket name to be successful, a file with that same name must not be linked in the filesystem, or `EADDRINUSE` will result. Happily, beacuse `try_make_tempfile` previously created a copy of the `TempFile` while destroying the original, `torch_shm_manager` did not encounter this. With this change, howevrer, `torch_shm_manager` must now explicitly destroy the `TempFile` before attempting to `bind()`. Unfortunately, this exposes a race condition--**other code can re-generate the same-named temporary file after the one created by `torch_shm_manager` is explicitly unlinked but before `torch_shm_manager` binds it to the server socket.** To be clear: this race condition already existed before this diff, but this makes things more explicit. The real fix will be in a follow-up change.

Reviewed By: ejguan

Differential Revision: D28047915

fbshipit-source-id: e8a1b6bb50419fe65620cfecdb67c566a4cf9056
2021-04-30 11:11:06 -07:00
Brad Fish
788aefd7cc Propagate information on torch_shm_manager failures to parent process (#57307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57307

Extend the `"ERROR"` message that `torch_shm_manager` writes to the pipe when it encounters a fatal error with some extra context (specifically, the `what()` on a caught `std::exception`), allowing the parent process to gain some insight into the cause of the failure.

Also, simply return from `main()` with an error exit code when a fatal exception is caught rather than re-throwing, because re-throwing leads to premature process termination that may prevent standard output from being flushed (and therefore the parent process from being able to read the error context from the pipe).

Reviewed By: ejguan

Differential Revision: D28047916

fbshipit-source-id: d423ee8ed1b2bf7831db877e8f8515ec6d6aa169
2021-04-30 11:09:47 -07:00
Yanli Zhao
3f81912885 static graph api skeleton (#54995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995

provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27444965

fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
2021-04-30 11:07:26 -07:00
Yanli Zhao
5f2b9b1df9 refactor autograd_hook (#54981)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54981

put part of codes in autograd_hook into functions, so that they can be used in the static graph training later on.
ghstack-source-id: 127755405

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27439508

fbshipit-source-id: a02a4b029841f5e7f11cfc5496bb7972ef53d878
2021-04-30 11:06:04 -07:00
davidriazati@fb.com
c44cbc63cc Ignore more compiler warnings, unify WERROR options (#56630)
Summary:
This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet).
](https://our.intern.facebook.com/intern/diff/28005063/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630

Pulled By: driazati

Reviewed By: malfet

Differential Revision: D28005063

fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0
2021-04-29 21:20:29 -07:00
Howard Huang
149000c3f0 Update compare_set docs (#57203)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57203

Update documentation to remove warning. Refactored arguments from `old_value` -> `expected_value` and `new_value` -> `desired_value`

Test Plan: Imported from OSS

Reviewed By: gchanan, cbalioglu

Differential Revision: D28076556

Pulled By: H-Huang

fbshipit-source-id: 5fcc5bcfff89cad51d8dc0b74a234964f1af20ed
2021-04-29 13:58:57 -07:00
Howard Huang
95f393f212 Add compare_set to trampoline class, add typing and formatting (#57191)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57191

Changed Store::compareSet() to a pure virtual function and added compareSet definition to PythonStore. Rest of changes are from clang-format.

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28076557

Pulled By: H-Huang

fbshipit-source-id: 379636cf8b031088341a032250ba410d84ccf692
2021-04-29 13:29:11 -07:00
Howard Huang
ee71584236 Update compare_set implementation for FileStore and HashStore (#57175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57175

Update other Store implementations to add the value when current value is empty to match the amendment made to TCPStore (#55636). Added test to cover this case.

Test:
`pytest -vs test/distributed/test_c10d_common.py -k compare_set`

Test Plan: Imported from OSS

Reviewed By: cbalioglu

Differential Revision: D28069380

Pulled By: H-Huang

fbshipit-source-id: eac703edb41faee32a4e7cda61107e2a0e726326
2021-04-29 10:48:11 -07:00
Luca Wehrstedt
311ad5e3af Merge CUDAFuture into ivalue::Future (#57052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052

This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.

This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28036829

fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
2021-04-29 09:31:52 -07:00
Luca Wehrstedt
71c2f88b90 Make CUDAFuture handle any kind of device type (#57051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57051

Make CUDAFuture autodetect the devicetype from its arguments (which thus change from DeviceIndices to full Devices). This in fact transforms CUDAFuture into a AnythingFuture, since it's not tied to CUDA in any way anymore. Having made it fully device-agnostic, we'll merge it into ivalue::Future in the next PR.
ghstack-source-id: 127713134

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28032711

fbshipit-source-id: 8ba23b1b0d97f61db8693cd5f3c7bae7989a9bcd
2021-04-29 09:31:50 -07:00
Luca Wehrstedt
682476022f Introduce generic MultiStreamGuard (#57049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049

There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.

The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28029158

fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b
2021-04-29 09:31:47 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Howard Huang
5a10ee71d6 [Reland] TCPStore add watchKey method and new listener thread (#56217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56217

Reland of https://github.com/pytorch/pytorch/pull/54264

Changes:
- Update socket send() to use flag MSG_NOSIGNAL to prevent SIGPIPE because error in return is already capturad
- Update watchKey to block until callback has been registered on master.
- Fix race condition in testWatchKeyCallback which caused flaky test failures.

Test:
Ran TCPStoreTest 100 times locally with no errors, running [ci-all tests](https://github.com/pytorch/pytorch/pull/56219)

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27824802

Pulled By: H-Huang

fbshipit-source-id: c32230ce726d7d848b9896a63aa52b8eb04a0a2d
2021-04-28 13:46:02 -07:00
Rohan Varma
fe09d54120 [c10d] Add debug level field in ProcessGroup (#56530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56530

For upcoming diffs, ProcessGroup will need to know about debug level
for e.g. logging collective operations.
ghstack-source-id: 127535775

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27849839

fbshipit-source-id: a9f016a27d30a242eced19929b3824ae68fe430f
2021-04-28 10:01:21 -07:00
Alexander Golynski
4638bd0f0f Fix ProcessGroupMPITest.cpp Gather, Scatter and SendRecv. Enable ProcessGroupMPITest (#56709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56709

Right now, ProcessGroupMPITest testGather() fails with

 ```
what():  Gather: number of output tensors should be 0 for non-root
[devgpu025:429730] *** Process received signal ***

```

there is a similar issue with testScatter() where number of input/output tensors on source/destination respectively should be 0.

In addition testSendRecv(true); fails with

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  src rank is wrong for recvAnysource

```

since we never populate `srcRanks`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28001963

Pulled By: agolynski

fbshipit-source-id: c381dfc6f417ee78fbbaf884e567b0485076dfc8
2021-04-28 08:39:08 -07:00
Yanli Zhao
1e77ba36db change ddpLoggingData struct to map or dict (#56641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641

currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.

With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.

Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923723

fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
2021-04-28 06:43:25 -07:00
Yanli Zhao
28a9483e36 fix ddp logging test (#56640)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56640

reset performance stats for current iteration, also fix ddp logging verifiction for sampled iterations.
ghstack-source-id: 127327708

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D27923414

fbshipit-source-id: aaa1b10f64a0c952ba345c789c864bcef5cf1ab0
2021-04-26 10:12:05 -07:00
Rohan Varma
2d2370bb61 [Dist profiling] Fix ProcessGroupNCCL collective profiling (#55204)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204

Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL.

Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives.

ghstack-source-id: 127357995

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27491711

fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0
2021-04-25 19:40:19 -07:00
Liang Luo
c37095760d [torch distributed] Implementing all_gather_base (#56315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315

This diff implements the all_gather_base in pytorch distributed.

Test Plan: dist.all_gather_base(output, input)...

Reviewed By: agolynski, amylittleyang

Differential Revision: D27488999

fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936
2021-04-23 14:16:47 -07:00
Rohan Varma
7ff1990caf [c10d] Increment sequence numbers on collectives. (#55718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55718

Increments sequence numbers when ProcessGroupGloo::enqueue or
ProcessGroupNCCL::collective is run, which is a common call all collectives
make. The next step will be to log these along with other collective info in
debug mode as well as integrating them with the process group wrapper.
ghstack-source-id: 127215077

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27690690

fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6
2021-04-23 10:06:56 -07:00
Luca Wehrstedt
58d12eb75e Allow to specify a set of device for CUDAFuture (#56515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515

In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.

We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).

I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
ghstack-source-id: 127261552

Test Plan: Added a test for this later in the stack.

Reviewed By: mrshenli

Differential Revision: D27861067

fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09
2021-04-23 08:12:41 -07:00
Pavel Belevich
5cc75e46fa Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b
2021-04-21 22:10:41 -07:00
Wanchao Liang
43ad172c54 make ProcessGroupDefaultTimeout the same as python (#56549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549

This make the `kProcessGroupDefaultTimeout` be the same as the python
side, and python side directly use the pybind value instead

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27899190

Pulled By: wanchaol

fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6
2021-04-21 17:56:05 -07:00
Wanchao Liang
a970e525fd make ProcessGroup.Options.timeout argument private in python (#56531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531

per discussions in
https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need
to make sure our API not confusing user by passing in both timeout in
argument and timeout in processgroup.options. This PR tries to make the
`ProcessGroup.Options.timeout` be a private field, and only be used in
our test utils, for both `init_process_group` and `new_group`, we still
allow user pass `timeout` as a separate argument. Since
`ProcessGroupGloo.Options` only have a `timeout` config, both functions
will not allow passing in options for the GLOO backend.

This way we still preserve the only `timeout` API, and only allow user
to use `ProcessGroupNCCL.Options` when needed.

cc pritamdamania87 rohan-varma

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27893395

Pulled By: wanchaol

fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5
2021-04-21 17:55:10 -07:00
Ailing Zhang
27a0d6f1df AutoDispatchBelowAutograd takes no arguments. (#56424)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56424

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27866607

Pulled By: ailzhang

fbshipit-source-id: b82cfb90af5bc7b4129266083fe31f8b335a5b41
2021-04-21 14:44:12 -07:00
Rohan Varma
b7d5a0cf10 [c10d] sequence number in process group (#55319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319

Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability.

The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization.

This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning.
ghstack-source-id: 127011277

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27562769

fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b
2021-04-21 10:59:24 -07:00
Ailing Zhang
3d904b56ec s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27866606

Pulled By: ailzhang

fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1
2021-04-20 17:17:46 -07:00
marksaroufim
48aaea3359 unified GlooStore and c10d store API (#56222)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55719

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27785267

Pulled By: msaroufim

fbshipit-source-id: ce247f9226ecc971af8e1f08adeb835f64973e12
2021-04-19 10:57:18 -07:00
Jay Chae
400398006f [PARAM] Param comms debug info (#55976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55976

- Define a concrete `DebugInfo` to collect Param comms.
- Add a macro to easily log `DebugInfo`

Test Plan:
Tested on `ads:simplified_launcher` with `dyno gputrace`
locally tested in libkinetoObserver that it can collect the debug Infobase

Reviewed By: kingchc, ilia-cher

Differential Revision: D26773447

fbshipit-source-id: a8eeede2d6dbf34d7a1b3614843b4a1baba94448
2021-04-15 16:22:01 -07:00
Rohan Varma
51e7a371f5 [DDP] Param to name mapping in Reducer (#55075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075

Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.

Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.

Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051

Test Plan: CI, UT

Reviewed By: zhaojuanmao

Differential Revision: D27356394

fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
2021-04-15 09:19:50 -07:00
Brian Hirsh
e8faf69739 fix torch.pow type promotion issue (#54085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085

Fixes https://github.com/pytorch/pytorch/issues/50121.

This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))!

Before:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(131072)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([131072], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(131072, device='cuda:0')
```

After:
```
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0]))
tensor([0])
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0))
tensor(0)
>>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda'))
tensor([0], device='cuda:0')
>>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda'))
tensor(0, device='cuda:0')
```

In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`.

There were two separate issues fixed in this PR: one for cpu and one for cuda:
* For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](aa8714dfed/c10/core/TensorImpl.h (L590)))
* For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype.

As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349))). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27096330

Pulled By: bdhirsh

fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2
2021-04-15 08:55:53 -07:00
Howard Huang
5cab3b9cf6 Revert D27709912: TCPStore add watchKey method and new listener thread
Test Plan: revert-hammer

Differential Revision:
D27709912 (f8f756efb2)

Original commit changeset: 619aa3b2a8eb

fbshipit-source-id: 3ef96ccaa76c702d7e5427dfc263531fb1c274ab
2021-04-15 07:43:48 -07:00
Howard Huang
f8f756efb2 TCPStore add watchKey method and new listener thread (#54264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54264

**Changes**

- Creates new listener thread on each client to run the callback
- Create new class which listener thread and master thread derive from, this class is used to handle shut down and clean up of the thread in windows and linux
- Add watchKey method and update any functions that changes the key value.

**Background**
This PR adds functionality to TCPStore to allow users to watch a key and execute a callback on key change.

It introduces this a new watchKey() API:
`TCPStore::watchKey(const std::string& key, std::function<void(std::string, std::string)> callback)` which has parameters `key` and `callback(old_key, new_key)` to run on key change. Since current methods are blocking, for example in`TCPStore::get()` a worker will send a "get key" request to the master -> wait for a response back -> then exit the function and return the value to user, we need a non-blocking, asynchronous way to execute the callback whenever a key changes. This is done by creating a new listener thread on each client which the master can communicate with.

Right now, the API is C++ only and only for TCPStore, the internal use case is for elastic RPC. We will have an internal key such as `_NumNodes` and all nodes in the elastic RPC group will watch this key. When a node leaves, this key will be updated and each node will execute a callback to clean up Autograd context and RRef context.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27709912

Pulled By: H-Huang

fbshipit-source-id: 619aa3b2a8eb23f4be5f5736efdcca6c175aadf3
2021-04-14 13:23:12 -07:00
Rohan Varma
bbc4c775bb [reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.

Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms.
ghstack-source-id: 126478371

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758424

fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63
2021-04-14 12:26:54 -07:00
Rohan Varma
752f5b1030 [reland][c10d] Log API usage of monitored barrier (#55989)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55989

Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master.
ghstack-source-id: 126477554

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27758425

fbshipit-source-id: ebca8b6baf0019879bc4b16639d6cccf27dc6b1c
2021-04-14 12:25:35 -07:00
Rohan Varma
48c73d24b8 Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do
Test Plan: revert-hammer

Differential Revision:
D27523060 (a5290adea5)

Original commit changeset: fa05e4f8ad8a

fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a
2021-04-13 21:33:09 -07:00
Rohan Varma
c7aa1026a8 Revert D27548433: [c10d] Log API usage of monitored barrier
Test Plan: revert-hammer

Differential Revision:
D27548433 (09231b5db1)

Original commit changeset: 7520ad0948b8

fbshipit-source-id: aa946d8d27472d19c0fe855952ec58d1266ee35a
2021-04-13 21:31:49 -07:00
Rohan Varma
09231b5db1 [c10d] Log API usage of monitored barrier (#55265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55265

Logs API usage of monitored barrier for better tracking and use case
understanding.
ghstack-source-id: 126413087

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27548433

fbshipit-source-id: 7520ad0948b8dc9d44fa3118d5ea953d52f9f1c5
2021-04-13 19:02:52 -07:00
Rohan Varma
a5290adea5 [c10d] monitored_barrier: ensure all ranks pass or none do (#55197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197

From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics.

In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier.

This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements:

Nonzero ranks call send()
Nonzero ranks call recv()

Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy
Once all ranks are acknowledged as healthy:
Rank 0 calls send() to all nonzero ranks to unblock them

Modified unittests to ensure the all or nothing failure behavior
ghstack-source-id: 126413088

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27523060

fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6
2021-04-13 19:01:25 -07:00
Yi Wang
132f5c1f36 Clang-format ProcessGroupMPI.cpp (#55969)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55969

Per title
ghstack-source-id: 126453717

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D27752173

fbshipit-source-id: e5069b91d699b9d02b12e5dab5e62007dbcee9f0
2021-04-13 17:11:19 -07:00
Yi Wang
de5e3b5eb0 Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921

Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times.

Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase.

Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th.

First failure (on Mar 10th):
https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129

Note that the test failure cannot be reproduced locally.

Verified the fix on CI:
https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852
test_destroy_full_group has rerun 100 times and pass.

#Closes: https://github.com/pytorch/pytorch/issues/53899
ghstack-source-id: 126414937

Test Plan:
```
export BACKEND=mpi
export WORLD_SIZE=2
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs
```

```
#!/bin/bash
for i in {1..100}
do
pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py
done
```

The CI tests triggered by a new branch:
https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi

Reviewed By: mrshenli

Differential Revision: D27245421

fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27
2021-04-13 15:28:51 -07:00
Rohan Varma
c218ac3bc0 [NCCL] Join work clean up thread before aborting communicators (#55444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55444

Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first.

The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below:

I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally
I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2.
7.3
With this change, we no longer see these false positive logs.
ghstack-source-id: 126145284

Test Plan: CI

Reviewed By: osalpekar

Differential Revision: D27613035

fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96
2021-04-13 15:25:22 -07:00
Yanli Zhao
5ffc4e3b0f refactor prepare_for_backward (#54977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54977

put part of codes in prepare_for_backward into functions, so that those functions can be used in static graph training and delay all reduce later on.
ghstack-source-id: 126366714

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D27439195

fbshipit-source-id: 8899eda621260232d774cb145f9c6d683c47e188
2021-04-13 14:25:29 -07:00
Rohan Varma
657b66e87d [NCCL] Log when barrier guesses device to use (#54991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54991

Actual proposed fix is in
https://github.com/pytorch/pytorch/pull/53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27444917

fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
2021-04-13 11:53:55 -07:00
Can Balioglu
339d3bf394 [2/n] [torch/elastic] Introduce C10dRendezvousBackend. (#55636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636

This diff introduces:

- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162

Test Plan: Run the existing and newly-introduced unit/integration tests.

Reviewed By: tierex

Differential Revision: D27654492

fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
2021-04-12 22:20:27 -07:00
Yi Wang
3e9cbe5ef7 [SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353

Remove all the code branches that will only be executed when `device_ids > 1`.

Some helper functions are also removed:
1.  `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`

The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121

Test Plan: waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D27552201

fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
2021-04-09 17:27:56 -07:00
Rohan Varma
0e03a2978a [DDP] Call ensure_prior_reduction_finished within lock (#55074)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55074

This function accesses member variables that can be modified by
different threads (i.e. autograd engine threads), so call it within lock scope.
ghstack-source-id: 125707513

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D27474526

fbshipit-source-id: 8d43faedd6e6eeeb69e21ce3262337ab83d7ba07
2021-04-05 22:16:13 -07:00
Yi Wang
6a2f046504 [SPMD] Restrict DDP communication hooks to SPSD mode (#55253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253

Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks.

The next step is limiting only 1 model replica in Reducer.
ghstack-source-id: 125677637

Test Plan: waitforbuildbot

Reviewed By: zhaojuanmao

Differential Revision: D27533898

fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae
2021-04-05 16:46:47 -07:00
Rohan Varma
19a0eb4cdb [c10d] Monitored barrier: option to collect all failed ranks (#55010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010

Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one.

This is done by passing in a flag `wait_all_ranks=True`.
ghstack-source-id: 125699839

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27447787

fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb
2021-04-04 21:39:54 -07:00
Rohan Varma
0ec1af4b7e [c10d] Enforce order of waited ranks in monitored barrier. (#55009)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009

Changes monitoredBarrier so that we await acknowledgemenet from ranks
in a consistent order (from least to greatest). This will reduce confusion
around the order the ranks are awaited. We are still planning to add support
for awaiting all ranks in follow up changes.
ghstack-source-id: 125699838

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27405417

fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71
2021-04-04 21:38:25 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Yi Wang
322854d2f0 [SPMD] Error out SPMD in C++ Reducer (#55212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212

Error out SPMD in C++ Reducer.

Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor.

Removed 2 tests relevant to reducer in SPMD mode:
`test_ddp_comm_hook_multiple_replica_check`
`test_forward_backward_multi_replica`

ghstack-source-id: 125602472

Test Plan: waitforbuildbot

Reviewed By: pritamdamania87

Differential Revision: D27497747

fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562
2021-04-02 22:59:25 -07:00