pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yi Wang	459270ac01	[Gradient Compression] Apply division first to avoid overflow (#59522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522 If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce. This fix is applied to both C++ and Python comm hooks. ghstack-source-id: 130686229 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view Reviewed By: rohan-varma Differential Revision: D28922548 fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be	2021-06-07 01:43:10 -07:00
Yi Wang	3137bbeb1a	[Reland][DDP] Merge work and future_work in reducer (#59520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520 Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow. #Original PR Issue: https://github.com/pytorch/pytorch/issues/41266 ghstack-source-id: 130685351 Test Plan: buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16 buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view Reviewed By: walterddr Differential Revision: D28922305 fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c	2021-06-06 09:49:08 -07:00
Can Balioglu	1d9c1cc00a	[4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331 This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair. ghstack-source-id: 130676394 Test Plan: - Run the existing and newly-introduced tests. - Run several smoke tests including the short code snippet referred in GitHub issue #41614. Reviewed By: H-Huang Differential Revision: D28453850 fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6	2021-06-05 07:50:07 -07:00
Can Balioglu	844a98758a	[3/n] [c10d] Revise the implementation of TCPStore (#58330 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58330 This PR is part of a stack that addresses the GitHub issue #41614; it introduces a major refactoring of the `TCPStore` class in preparation of the multi-tenancy feature. - All TCP sockets are wrapped with a new `TCPSocket` RAII type. - `BackgroundThread` and daemon types are moved from header to cpp file. - Server, client, and callback sockets are refactored into their own internal types `TCPServer`, `TCPClient` and `TCPCallbackClient`. - Calls to `tcputil::send` and `tcputil::recv` are wrapped in `TCPClient` for easier readability and maintenance purposes. - Two `TODO` statements are put to reference future improvements. Based on feedback, I will either create separate GitHub issues for them or address them as part of this stack. ghstack-source-id: 130676392 Test Plan: Run the existing tests since there are no user-facing behavioral changes. Reviewed By: H-Huang Differential Revision: D28448981 fbshipit-source-id: 415b21e74b3cd51d673c1d5c349c6a2cb21dd667	2021-06-05 07:50:06 -07:00
Can Balioglu	4ee761c2c5	[2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329 This PR is part of a stack that addresses the GitHub issue #41614; it introduces: - A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair. - Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature. Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output. ghstack-source-id: 130676389 Test Plan: Run the existing tests since there are no behavioral changes. Reviewed By: rohan-varma Differential Revision: D28424978 fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29	2021-06-05 07:50:04 -07:00
Can Balioglu	cf408c3743	[1/n] [c10d] Introduce a new TCPStore constructor (#58328 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58328 This PR is part of a stack that addresses the GitHub issue #41614; it introduces a new `TCPStore` constructor that takes its optional parameters via a newly introduced `TCPStoreOptions` structure. This gives the API callers the flexibility to specify only the desired options while skipping the rest. The main motivation behind this change is the introduction of the `multiTenant` constructor option in the second PR of this stack. ghstack-source-id: 130676384 Test Plan: Run the existing tests since there are no behavioral changes. Reviewed By: H-Huang Differential Revision: D28417742 fbshipit-source-id: e6ac2a057f7ad1908581176ee6d2c2554c3c74a9	2021-06-05 07:50:02 -07:00
Rong Rong (AI Infra)	c88a0b55b3	Revert D28677383: [DDP] Merge work and future_work in reducer Test Plan: revert-hammer Differential Revision: D28677383 (`f8bebade47`) Original commit changeset: 85e0620378b7 fbshipit-source-id: ef3c65b88c375aa9a6befe2ab004ec37ae7eb587	2021-06-05 07:25:44 -07:00
Yi Wang	f8bebade47	[DDP] Merge work and future_work in reducer (#58937 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58937 Remove `work` attribute from Reducer class in favor of `future_work`. Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input. #Original PR Issue: https://github.com/pytorch/pytorch/issues/41266 ghstack-source-id: 130673249 Test Plan: buck test caffe2/test/distributed:distributed_gloo_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs Reviewed By: agolynski Differential Revision: D28677383 fbshipit-source-id: 85e0620378b7e9d837e436e94b9d807631d7d752	2021-06-05 01:18:30 -07:00
Alexander Golynski	1183fa3817	Switch PG::Work to Future in default_comm_hooks.cpp (#59398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59398 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28876182 Pulled By: agolynski fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e	2021-06-04 15:27:13 -07:00
Liang Luo	77de640f4b	[torch distributed] Implementing reduce_scatter_base (#57567 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567 Support flattened reduce_scatter. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d Reviewed By: zhaojuanmao Differential Revision: D27876281 fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298	2021-06-03 17:17:53 -07:00
Rohan Varma	332b01e93f	[DDP] log usage of torch_distributed_debug (#59351 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59351 Logging PT distributed debug level to track usage internally. ghstack-source-id: 130443122 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28854914 fbshipit-source-id: a8e85ca4a3c9ac2f18d13190e87c0ebc4a8e7ea2	2021-06-03 11:49:23 -07:00
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
Rohan Varma	79aeca0b00	[DDP] Log when errors happen (#59281 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281 Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has occured in this iteration, and the other fields (performance stats) are not guaranteed to be updated. Errors encountered in python-side DDP will be added in the next diff. ghstack-source-id: 130412974 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28652717 fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589	2021-06-02 19:48:26 -07:00
Rohan Varma	1968efa2dd	[c10d] Remove verbose log (#59070 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59070 This log is too verbose, especially in the case we call monitored barrier before every collective as we do in ProcessGroupWrapper. ghstack-source-id: 130052822 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28738189 fbshipit-source-id: f2899537caa4c13508da31134d5dd0f4fd6a1f3a	2021-06-02 13:50:11 -07:00
Michael Suo	b977a3b66d	[c10d] Split custom class bindings out of python binding code (#58992 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58992 Currently, we define Torchbind custom classes in the same place that we define Python bindings. This is nice from a code location perspective, but has two downsides: 1. These custom classes are not available in a C++-only build. 2. These break when included in torch::deploy. Some explanation on the second issue: torch::deploy creates many Python interpreters, and creates a full copy of all the bindings for each one. This will run the static initialization code once for each copy of the bindings, leading to multiple registration of the custom classes (and therefore an error). This PR splits out the relevant custom class binding code into its own source file to be included in libc10d, which can be compiled and statically initialized a single time and linked against from the c10d python bindings. ghstack-source-id: 130168942 Test Plan: CI Reviewed By: wconstab Differential Revision: D28690832 fbshipit-source-id: 3c5e3fff28abb8bcdb4a952794c07de1ee2ae5a8	2021-05-28 15:35:23 -07:00
Nikita Shulga	0e9a295b41	Refactor GlooDeviceFactory::makeDeviceFor... (#58996 ) Summary: `makeDeviceForHostname` and `makeDeviceForInterface` are almost duplicate except for different default argument values Create generic `makeGlooDevice` anonymous function that takes both host name and interface name and call it from both makeDeviceFor[Hostname\|Interface] Also solve two other minor issues: - do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load time - Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value Fixes #{issue number} Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996 Reviewed By: pbelevich Differential Revision: D28713324 Pulled By: malfet fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d	2021-05-26 20:33:11 -07:00
Rohan Varma	cf395c0718	[c10d] Introduce ProcessGroupWrapper (#58224 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224 Adds C++ implementation of ProcessGroupWrapper. It wraps an underlying ProcessGroup and does debug checks before dispatching the collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071. Concretely, on each collective, we: 1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another) 2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out. This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence. Once all of this passes we simply dispatch the collective to the underlying pg. Added `ProcessGroupWrapperTest` in python to comprehensively test these changes. ghstack-source-id: 129735687 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D28023981 fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64	2021-05-24 20:09:51 -07:00
Rohan Varma	76ce925257	[c10d] Fix monitored_barrier with wait_all_ranks (#58702 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58702 Off by one error when determining if some ranks failed or not with `wait_all_ranks=True`. This wasn't caught by tests because the tests only tested failure scenarios, not success scenarios with `wait_all_ranks=True`. ghstack-source-id: 129559840 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28583235 fbshipit-source-id: a8f376efb13a3f36c788667acab86543c80aff59	2021-05-21 09:40:50 -07:00
Rohan Varma	b301558410	[Reducer] Remove replica size == 1 checks (#58603 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58603 No longer need these checks ghstack-source-id: 129498227 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28549893 fbshipit-source-id: a89bf8c3fc3aba311a70fd37e5a6aa5dc14b41b9	2021-05-20 22:34:23 -07:00
Rohan Varma	88c76b43fb	[Reducer] move comment to the right place (#58594 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58594 This comment was misplaced after some changes, move it to the right place. ghstack-source-id: 129498228 Test Plan: ci Reviewed By: zhaojuanmao Differential Revision: D28548100 fbshipit-source-id: a9163fc3b25a9d9b8b6d4bfa2a77af290108fc09	2021-05-20 22:34:17 -07:00
Rohan Varma	d83c5a5c7f	Format reducer.cpp, hpp (#58593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58593 Per title ghstack-source-id: 129498230 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28528465 fbshipit-source-id: 89e4bfcb4a0275dc17090a934d4c0a41a3c54046	2021-05-20 22:32:30 -07:00
Rohan Varma	62adf9e1c9	[Reducer] Completely remove VariableIndex (#58592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58592 Completely removes VariableIndex from reducer code, as it is not needed. replica_index is always 0 so simplify the code to only use the parameter index. Next, we should also remove all of the nested data structures that were needed when num_replicas > 1 was possible. ghstack-source-id: 129498226 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28528440 fbshipit-source-id: e0568399264ab4f86de3b7a379a4f0831f8f42e9	2021-05-20 19:47:50 -07:00
Rohan Varma	faa7d3793d	[DDP] Support not all outputs used in loss calculation (#57081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081 Changes in this diff: Enable passthrough autograd function when find_unused_parameters=True. With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True. Enhance process of unused parameter checking to account for outputs not being used in loss. The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched. Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after. Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance. ghstack-source-id: 129425139 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28048628 fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61	2021-05-20 08:34:33 -07:00
Ching-Hsiang Chu	b9b8522e00	[profile] fix recorded data type (#58531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58531 fix data type of alltoall(v) when recording communication metadata via DebugInfo in NCCL PG Reviewed By: chaekit Differential Revision: D28529372 fbshipit-source-id: 2917653f73f5fe4f6dc901803235994ca042bba2	2021-05-19 14:14:54 -07:00
Rohan Varma	1ba05efd26	[Reducer] Remove some unused variables (#58524 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58524 Per title ghstack-source-id: 129311600 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28528223 fbshipit-source-id: 239a15de4b602e35ed9b15b8a4bea3c28b61de12	2021-05-19 09:55:04 -07:00
Yanli Zhao	ea0f7c4720	move unused parameters to end of bucket orders when rebuild buckets for static graph (#58097 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58097 move unused parameters to end of bucket orders when rebuild buckets for static graph Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D28366689 fbshipit-source-id: fbd224aeb761d5aa3bab35a00d64974eb4455b2e	2021-05-18 16:36:40 -07:00
zhouzhuojie	eab59bae15	Fix cmake_minimum_require in libshm (#58306 ) Summary: Deprecation warning reported by cmake: ``` CMake Deprecation Warning at CMakeLists.txt (cmake_minimum_required): Compatibility with CMake < 2.8.12 will be removed from a future version of CMake. Update the VERSION argument <min> value or use a ...<max> suffix to tell CMake that the project does not need compatibility with older versions. ``` This is the only place that requires bumping min version. There're two others but only in `third_party` folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58306 Reviewed By: bdhirsh Differential Revision: D28446097 Pulled By: zhouzhuojie fbshipit-source-id: af5ef50e61bd57dc36089ebe62db70ba0081864c	2021-05-17 09:55:07 -07:00
Yi Wang	581bf01074	[Gradient Compression] Remove unnecessary warning on the rst file and the check on C++ version (#58170 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58170 Now comm hook can be supported on MPI and GLOO backends besides NCCL. No longer need these warnings and check. ghstack-source-id: 128799123 Test Plan: N/A Reviewed By: agolynski Differential Revision: D28388861 fbshipit-source-id: f56a7b9f42bfae1e904f58cdeccf7ceefcbb0850	2021-05-12 14:15:10 -07:00
Alexander Golynski	4ef94265e9	Add Futures to ProcessGroupGloo (#57818 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57818 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28304171 Pulled By: agolynski fbshipit-source-id: dbf7f5538890d138582831aa0279ede89619ea1e	2021-05-11 14:47:09 -07:00
Erjia Guan	d49f6d556b	[DataLoader] Fix tempfile binding and removing for torch_shm_manager (#57566 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57566 Fix the problem that `tempfile` has never been deleted even after `torch_shm_manager` is destroyed. - The previous implementation has wrong path length for the Linux Socket. It leads to we lose the last character of the name of `tempfile` when bind the pathname to socket. At the end, we can not delete this file due to unexpected file name. - After we solve the racing problem by introducing a temporary directory, it becomes more dangerous since it prevents `torch_shm_manager` to delete directory as the tempfile persists in the temporary directory. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D28202866 Pulled By: ejguan fbshipit-source-id: 912cfd8fec0cc309d47df223b2b0faa599c60799	2021-05-11 14:14:58 -07:00
Yanli Zhao	ea421fb249	enable static graph training in DDP (#55248 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248 This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph. 1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads. 2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer. 3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication. 4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration. 5. if the graph states change, DDP will throw errors ghstack-source-id: 128599464 Test Plan: unit tests. adding more tests Reviewed By: rohan-varma Differential Revision: D27539964 fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246	2021-05-11 10:23:25 -07:00
Rohan Varma	5840c8cfd8	[nccl] log rank when communicator is aborted (#57974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57974 We see this error quite a bit in internal workflows, would be useful to have this additional logging information here. ghstack-source-id: 128602199 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D28331693 fbshipit-source-id: 25398c6a3420a2b594d79aa8f46936cd0addd426	2021-05-10 21:23:31 -07:00
Alexander Golynski	db412a6885	Avoid 2 extra copies when reducing sparse tensors and fix result() vs inplace output discrepancy (#57822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57822 * `AsyncSparseAllreduceWork` can avoid copying output tensors, since we keep all the results alive by means of modifying input vector directly * `AsyncSparseAllreduceWork` now returns inputs back to user instead of former behavior where it returned copies of inputs. This is consistent with other operations and process group implementations * `AsyncSparseAllreduceCUDAWork` is now copying tensors directly from CPU to input tensors avoiding extra copy `output` -> `outputs` -> `inputs`. inputs are being returned to back to user. This is consistent with other operations and process group implementations. overall AsyncSparseAllreduceCUDAWork is now avoiding 2 extra copies (as AsyncSparseAllreduceCUDAWork is using AsyncSparseAllreduceWork's impl) Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28298325 Pulled By: agolynski fbshipit-source-id: 18e2104413cdf5e73a01aad464e2613807779297	2021-05-07 15:12:58 -07:00
Pavel Belevich	96e1a83fb2	Add Gloo TCP_TLS transport (#56442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56442 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D27896285 Pulled By: pbelevich fbshipit-source-id: 589af59ca4c7c9bab2329f079382c09b71cfcf9e	2021-05-07 13:36:11 -07:00
Luca Wehrstedt	36e47af58b	Pass reference to parent future in callbacks (#57635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635 Note: this PR looks massive, but it's just one simple change, codemodded many times. In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't). ghstack-source-id: 128296580 Test Plan: CI Reviewed By: wanchaol Differential Revision: D28178783 fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081	2021-05-07 03:59:18 -07:00
Jay Chae	1101a5f6e9	[paramcomms] support for in and out split sizes (#57709 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57709 NOTE: initial commit got reverted D28247764 Adding way to accept in and out split sizes. Test Plan: {F613245151} https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723 (`fc657b547a`) UPDATED: now the sizes are encoded as arrays in .json https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces Reviewed By: kingchc Differential Revision: D28248333 fbshipit-source-id: cee523612667cb37170c94e3c40dab5fba432225	2021-05-06 12:04:34 -07:00
Alexander Golynski	dc06f52480	Add result() to ProcessGroupGloo::AsyncWork's (#57565 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57565 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D28255120 Pulled By: agolynski fbshipit-source-id: 1e904d4fe024d5b99cb642f8689ca32be0581e82	2021-05-06 08:48:48 -07:00
Horace He	ccbbb2d6f8	Revert D28052211: [paramcomms] support for in and out split sizes Test Plan: revert-hammer Differential Revision: D28052211 (`866b19e95d`) Original commit changeset: 4ab7d425fc72 fbshipit-source-id: 80c001ddcb3730f0487adddf66d9166f53c45a8c	2021-05-05 21:10:31 -07:00
Jay Chae	866b19e95d	[paramcomms] support for in and out split sizes Summary: Adding way to accept in and out split sizes. Test Plan: {F613245151} https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620153506%2F127.0.0.1%2Flibkineto_activities_1112677.json.gz&bucket=gpu_traces NOTE: ignore the GPU user showing up in CPU - the issue is fixed in the diff above the stack D28196723 UPDATED: now the sizes are encoded as arrays in .json https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1620259313%2F127.0.0.1%2Flibkineto_activities_3944235.json.gz&bucket=gpu_traces Reviewed By: kingchc Differential Revision: D28052211 fbshipit-source-id: 4ab7d425fc722907d9bbcfad7e364d031ff69b29	2021-05-05 20:46:11 -07:00
Rohan Varma	7115a4b870	Clang format ProcessGroupNCCL.cpp (#56840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56840 Per comments in https://github.com/pytorch/pytorch/pull/56427/files ghstack-source-id: 128142665 Test Plan: Ci Reviewed By: SciPioneer Differential Revision: D27980768 fbshipit-source-id: 0158ae1cfd892ff3385ffa0084dd7ef9de014f8c	2021-05-05 10:17:09 -07:00
Rohan Varma	a948e279ac	[c10d] Profiler support for nccl p2p collectives (#56427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56427 This PR enables support for nccl send/recv profiling similar to how we have it for MPI and Gloo. The process to do so is similar to the NCCL collectives where we create the `recordingFunction` in `initWork` and then add a callback that runs the profiler end callbacks. Tests are added similar to send/recv tests with gloo/MPI. We also test with both autograd profiler and torch.profiler. ghstack-source-id: 128142666 Test Plan: CI Reviewed By: mrshenli Differential Revision: D27866600 fbshipit-source-id: f29d9103e22b22f658632fece0df9ba36911fc62	2021-05-05 10:14:56 -07:00
Rohan Varma	7175d49122	[Dist profiling] Add is_async field (#57253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (https://github.com/pytorch/pytorch/pull/56963 tried to do so as well but this is a better approach). ghstack-source-id: 128021158 Test Plan: CI Reviewed By: walterddr, ilia-cher Differential Revision: D28086719 fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29	2021-05-04 17:44:28 -07:00
Alexander Golynski	2b6c09c11e	Add futures to ProcessGroupMPI work (but not including Send/Recv) and python DDP comm hook testing (#57214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57214 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28200791 Pulled By: agolynski fbshipit-source-id: 83f814abd4f2eea70e383ed373b04aae8291be55	2021-05-04 16:04:45 -07:00
Rohan Varma	375c8a81dc	[DDP] Profile search_unused_parameters (#57376 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57376 Having this in profiler/trace outputs will be useful when investigating performance overhead of find_unused_parameters for certain workloads, to determine whether it is a bottleneck or not. ghstack-source-id: 127942159 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28126233 fbshipit-source-id: 93082ae5b84e64351d59447a29f97eaf9b0bbd64	2021-05-03 09:41:18 -07:00
Alexander Golynski	f332a8bdff	Implement result() function in MPI Work classes (#57168 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57168 Implement result() for MPI which wasn't previously supported. Some user rely on output args, however in future usecases (e.g. DDP comm hook) we need to return the result explicitly. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D28129125 Pulled By: agolynski fbshipit-source-id: d6abcd2114163471c045043534a0a3377f2579b4	2021-05-03 07:12:46 -07:00
Brad Fish	e68c46bb3a	Propagate information on torch_shm_manager execl failure to parent process (#57310 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57310 If we fail to exec `torch_shm_manager`, write an appropriate error message to stdout so that the parent process can have some context on the failure. Reviewed By: ejguan Differential Revision: D28047917 fbshipit-source-id: 68bf357df7a6b318c036f4f62cbb428a62cb139e	2021-04-30 11:11:09 -07:00
Brad Fish	2c2aa9e030	Address temp file/bind race condition in torch_shm_manager (#57309 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57309 Addressing a race condition that can occur in `torch_shm_manager` between the time its temporary file is unlinked and when it `bind()`s the manager server socket to that same name. In that time window, other threads/processes can re-create another temporary file with the same name, causing `bind()` to fail with `EADDRINUSE`. This diff introduces `c10::TempDir` and associated helper functions that mirror those of `c10::TempFile` and generates the manager socket name using a combination of a temporary directory, which will be valid for the lifetime of `torch_shm_manager`, and a well-known file name within that directory that will never be used outside of `bind()`. Reviewed By: ejguan Differential Revision: D28047914 fbshipit-source-id: 148d54818add44159881d3afc2ffb31bd73bcabf	2021-04-30 11:11:07 -07:00
Brad Fish	7eed5410cd	Make c10::TempFile non-copyable but movable (#57308 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57308 This diff makes `c10::TempFile` non-copyable but movable. `torch_shm_manager` was previously dependent upon some hidden behavior that was a result of copying `TempFile`s, which is also being made more explicit now that they can be moved but not copied. Context: `c10::TempFile` is currently copyable, which leads to surprising behavior. A seemingly valid `TempFile` may in fact be invalid if the original it was copied from has already been destroyed, resulting in the file descriptor to be closed and the filename being unlinked without the user knowing about it. In fact, both `c10::try_make_tempfile` and `c10::make_tempfile` cause copies of `TempFile` to be made, which can easily be verified by explicitly deleting the copy constructor of `TempFile` and attempting to compile. This means that in practice, users of these functions are getting temporary files that have already been closed and unlinked. This copying of `TempFile` is particularly interesting in the case of `torch_shm_manager`, which uses `try_make_tempfile` to generate the name of a Unix domain socket to communicate with clients. In order for `bind()` on the socket name to be successful, a file with that same name must not be linked in the filesystem, or `EADDRINUSE` will result. Happily, beacuse `try_make_tempfile` previously created a copy of the `TempFile` while destroying the original, `torch_shm_manager` did not encounter this. With this change, howevrer, `torch_shm_manager` must now explicitly destroy the `TempFile` before attempting to `bind()`. Unfortunately, this exposes a race condition--other code can re-generate the same-named temporary file after the one created by `torch_shm_manager` is explicitly unlinked but before `torch_shm_manager` binds it to the server socket. To be clear: this race condition already existed before this diff, but this makes things more explicit. The real fix will be in a follow-up change. Reviewed By: ejguan Differential Revision: D28047915 fbshipit-source-id: e8a1b6bb50419fe65620cfecdb67c566a4cf9056	2021-04-30 11:11:06 -07:00
Brad Fish	788aefd7cc	Propagate information on torch_shm_manager failures to parent process (#57307 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57307 Extend the `"ERROR"` message that `torch_shm_manager` writes to the pipe when it encounters a fatal error with some extra context (specifically, the `what()` on a caught `std::exception`), allowing the parent process to gain some insight into the cause of the failure. Also, simply return from `main()` with an error exit code when a fatal exception is caught rather than re-throwing, because re-throwing leads to premature process termination that may prevent standard output from being flushed (and therefore the parent process from being able to read the error context from the pipe). Reviewed By: ejguan Differential Revision: D28047916 fbshipit-source-id: d423ee8ed1b2bf7831db877e8f8515ec6d6aa169	2021-04-30 11:09:47 -07:00
Yanli Zhao	3f81912885	static graph api skeleton (#54995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995 provide an DDP private API to explicitly set the training is static, also set this flag in logger ghstack-source-id: 127755713 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D27444965 fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359	2021-04-30 11:07:26 -07:00
Yanli Zhao	5f2b9b1df9	refactor autograd_hook (#54981 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54981 put part of codes in autograd_hook into functions, so that they can be used in the static graph training later on. ghstack-source-id: 127755405 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D27439508 fbshipit-source-id: a02a4b029841f5e7f11cfc5496bb7972ef53d878	2021-04-30 11:06:04 -07:00
davidriazati@fb.com	c44cbc63cc	Ignore more compiler warnings, unify WERROR options (#56630 ) Summary: This adds some more compiler warnings ignores for everything that happens on a standard CPU build (CUDA builds still have a bunch of warnings so we can't turn on `-Werror` everywhere yet). ](https://our.intern.facebook.com/intern/diff/28005063/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/56630 Pulled By: driazati Reviewed By: malfet Differential Revision: D28005063 fbshipit-source-id: 541ed415eb0470ddf7e08c22c5eb6da9db26e9a0	2021-04-29 21:20:29 -07:00
Howard Huang	149000c3f0	Update compare_set docs (#57203 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57203 Update documentation to remove warning. Refactored arguments from `old_value` -> `expected_value` and `new_value` -> `desired_value` Test Plan: Imported from OSS Reviewed By: gchanan, cbalioglu Differential Revision: D28076556 Pulled By: H-Huang fbshipit-source-id: 5fcc5bcfff89cad51d8dc0b74a234964f1af20ed	2021-04-29 13:58:57 -07:00
Howard Huang	95f393f212	Add compare_set to trampoline class, add typing and formatting (#57191 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57191 Changed Store::compareSet() to a pure virtual function and added compareSet definition to PythonStore. Rest of changes are from clang-format. Test Plan: Imported from OSS Reviewed By: cbalioglu Differential Revision: D28076557 Pulled By: H-Huang fbshipit-source-id: 379636cf8b031088341a032250ba410d84ccf692	2021-04-29 13:29:11 -07:00
Howard Huang	ee71584236	Update compare_set implementation for FileStore and HashStore (#57175 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57175 Update other Store implementations to add the value when current value is empty to match the amendment made to TCPStore (#55636). Added test to cover this case. Test: `pytest -vs test/distributed/test_c10d_common.py -k compare_set` Test Plan: Imported from OSS Reviewed By: cbalioglu Differential Revision: D28069380 Pulled By: H-Huang fbshipit-source-id: eac703edb41faee32a4e7cda61107e2a0e726326	2021-04-29 10:48:11 -07:00
Luca Wehrstedt	311ad5e3af	Merge CUDAFuture into ivalue::Future (#57052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052 This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future. This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles. ghstack-source-id: 127713138 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28036829 fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3	2021-04-29 09:31:52 -07:00
Luca Wehrstedt	71c2f88b90	Make CUDAFuture handle any kind of device type (#57051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57051 Make CUDAFuture autodetect the devicetype from its arguments (which thus change from DeviceIndices to full Devices). This in fact transforms CUDAFuture into a AnythingFuture, since it's not tied to CUDA in any way anymore. Having made it fully device-agnostic, we'll merge it into ivalue::Future in the next PR. ghstack-source-id: 127713134 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28032711 fbshipit-source-id: 8ba23b1b0d97f61db8693cd5f3c7bae7989a9bcd	2021-04-29 09:31:50 -07:00
Luca Wehrstedt	682476022f	Introduce generic MultiStreamGuard (#57049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049 There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here. The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup. ghstack-source-id: 127713139 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28029158 fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b	2021-04-29 09:31:47 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Howard Huang	5a10ee71d6	[Reland] TCPStore add watchKey method and new listener thread (#56217 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56217 Reland of https://github.com/pytorch/pytorch/pull/54264 Changes: - Update socket send() to use flag MSG_NOSIGNAL to prevent SIGPIPE because error in return is already capturad - Update watchKey to block until callback has been registered on master. - Fix race condition in testWatchKeyCallback which caused flaky test failures. Test: Ran TCPStoreTest 100 times locally with no errors, running [ci-all tests](https://github.com/pytorch/pytorch/pull/56219) Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27824802 Pulled By: H-Huang fbshipit-source-id: c32230ce726d7d848b9896a63aa52b8eb04a0a2d	2021-04-28 13:46:02 -07:00
Rohan Varma	fe09d54120	[c10d] Add debug level field in ProcessGroup (#56530 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56530 For upcoming diffs, ProcessGroup will need to know about debug level for e.g. logging collective operations. ghstack-source-id: 127535775 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27849839 fbshipit-source-id: a9f016a27d30a242eced19929b3824ae68fe430f	2021-04-28 10:01:21 -07:00
Alexander Golynski	4638bd0f0f	Fix ProcessGroupMPITest.cpp Gather, Scatter and SendRecv. Enable ProcessGroupMPITest (#56709 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56709 Right now, ProcessGroupMPITest testGather() fails with ``` what(): Gather: number of output tensors should be 0 for non-root [devgpu025:429730] * Process received signal * ``` there is a similar issue with testScatter() where number of input/output tensors on source/destination respectively should be 0. In addition testSendRecv(true); fails with ``` terminate called after throwing an instance of 'std::runtime_error' what(): src rank is wrong for recvAnysource ``` since we never populate `srcRanks` Test Plan: Imported from OSS Reviewed By: pbelevich Differential Revision: D28001963 Pulled By: agolynski fbshipit-source-id: c381dfc6f417ee78fbbaf884e567b0485076dfc8	2021-04-28 08:39:08 -07:00
Yanli Zhao	1e77ba36db	change ddpLoggingData struct to map or dict (#56641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641 currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData. With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field. Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps. ghstack-source-id: 127482694 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D27923723 fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1	2021-04-28 06:43:25 -07:00
Yanli Zhao	28a9483e36	fix ddp logging test (#56640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56640 reset performance stats for current iteration, also fix ddp logging verifiction for sampled iterations. ghstack-source-id: 127327708 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D27923414 fbshipit-source-id: aaa1b10f64a0c952ba345c789c864bcef5cf1ab0	2021-04-26 10:12:05 -07:00
Rohan Varma	2d2370bb61	[Dist profiling] Fix ProcessGroupNCCL collective profiling (#55204 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55204 Implements a fix discussed offline with pritamdamia87 to run end callbacks after `CUDAFuture`'s wrapCallback has ensured appropriate synchronization. Also enables the relevant distributed profiling tests that were previously disabled for ProcessGroupNCCL. Note that the profiling infrastructure has moved to primarily encourage the use of torch.profiler and CUPTI to trace CUDA kernels, support for distributed collectives for that will require further discussion with ilia-cher. However, this PR improves the usability of torch.autograd.profiler with respect to distributed collectives. ghstack-source-id: 127357995 Test Plan: CI Reviewed By: mrshenli Differential Revision: D27491711 fbshipit-source-id: cec7703a4c5d59b5023b0aa8fef4c2e3fb8d37d0	2021-04-25 19:40:19 -07:00
Liang Luo	c37095760d	[torch distributed] Implementing all_gather_base (#56315 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56315 This diff implements the all_gather_base in pytorch distributed. Test Plan: dist.all_gather_base(output, input)... Reviewed By: agolynski, amylittleyang Differential Revision: D27488999 fbshipit-source-id: 937ec8bddf9527fa4d114f984d1d0f6a5b8c3936	2021-04-23 14:16:47 -07:00
Rohan Varma	7ff1990caf	[c10d] Increment sequence numbers on collectives. (#55718 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55718 Increments sequence numbers when ProcessGroupGloo::enqueue or ProcessGroupNCCL::collective is run, which is a common call all collectives make. The next step will be to log these along with other collective info in debug mode as well as integrating them with the process group wrapper. ghstack-source-id: 127215077 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27690690 fbshipit-source-id: cb284b7c760763b7c0f814a41f06656fabf806d6	2021-04-23 10:06:56 -07:00
Luca Wehrstedt	58d12eb75e	Allow to specify a set of device for CUDAFuture (#56515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515 In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices. We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures). I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands. ghstack-source-id: 127261552 Test Plan: Added a test for this later in the stack. Reviewed By: mrshenli Differential Revision: D27861067 fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09	2021-04-23 08:12:41 -07:00
Pavel Belevich	5cc75e46fa	Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py (#56598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56598 Test Plan: NA Reviewed By: SciPioneer Differential Revision: D27913170 fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b	2021-04-21 22:10:41 -07:00
Wanchao Liang	43ad172c54	make ProcessGroupDefaultTimeout the same as python (#56549 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549 This make the `kProcessGroupDefaultTimeout` be the same as the python side, and python side directly use the pybind value instead Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27899190 Pulled By: wanchaol fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6	2021-04-21 17:56:05 -07:00
Wanchao Liang	a970e525fd	make ProcessGroup.Options.timeout argument private in python (#56531 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56531 per discussions in https://github.com/pytorch/pytorch/pull/53663/files#r593409009, we need to make sure our API not confusing user by passing in both timeout in argument and timeout in processgroup.options. This PR tries to make the `ProcessGroup.Options.timeout` be a private field, and only be used in our test utils, for both `init_process_group` and `new_group`, we still allow user pass `timeout` as a separate argument. Since `ProcessGroupGloo.Options` only have a `timeout` config, both functions will not allow passing in options for the GLOO backend. This way we still preserve the only `timeout` API, and only allow user to use `ProcessGroupNCCL.Options` when needed. cc pritamdamania87 rohan-varma Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27893395 Pulled By: wanchaol fbshipit-source-id: cdd29c84648002226ef3d9f9f3ea67b795e64bc5	2021-04-21 17:55:10 -07:00
Ailing Zhang	27a0d6f1df	AutoDispatchBelowAutograd takes no arguments. (#56424 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56424 Test Plan: Imported from OSS Reviewed By: nikithamalgifb Differential Revision: D27866607 Pulled By: ailzhang fbshipit-source-id: b82cfb90af5bc7b4129266083fe31f8b335a5b41	2021-04-21 14:44:12 -07:00
Rohan Varma	b7d5a0cf10	[c10d] sequence number in process group (#55319 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55319 Adds a sequence number class as well as integration with ProcessGroup (nccl and gloo) as part of better debugability. The main use case is that each ProcessGroup instantiated will have a sequence number initially set by rank 0, and broadcasted to all others. We will increment the number on each collective, thus allowing us to match the numbers appropriately when checking for desynchronization. This PR just adds the bare-bones integration and verifies sequence numbers are set appropriately at the beginning. ghstack-source-id: 127011277 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27562769 fbshipit-source-id: d4a4de7529ce07a0c86fcf6beb06f317f359d89b	2021-04-21 10:59:24 -07:00
Ailing Zhang	3d904b56ec	s/AutoNonVariableTypeMode/AutoDispatchBelowAutograd/ (#56423 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56423 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D27866606 Pulled By: ailzhang fbshipit-source-id: e3942356dc3133d1c5722de40ec0d45e6a60f2f1	2021-04-20 17:17:46 -07:00
marksaroufim	48aaea3359	unified GlooStore and c10d store API (#56222 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/55719 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27785267 Pulled By: msaroufim fbshipit-source-id: ce247f9226ecc971af8e1f08adeb835f64973e12	2021-04-19 10:57:18 -07:00
Jay Chae	400398006f	[PARAM] Param comms debug info (#55976 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55976 - Define a concrete `DebugInfo` to collect Param comms. - Add a macro to easily log `DebugInfo` Test Plan: Tested on `ads:simplified_launcher` with `dyno gputrace` locally tested in libkinetoObserver that it can collect the debug Infobase Reviewed By: kingchc, ilia-cher Differential Revision: D26773447 fbshipit-source-id: a8eeede2d6dbf34d7a1b3614843b4a1baba94448	2021-04-15 16:22:01 -07:00
Rohan Varma	51e7a371f5	[DDP] Param to name mapping in Reducer (#55075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075 Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient. Use case: 1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration 2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case. 3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error. Implementation details: 0) The following is only enabled for debug modes of INFO or DETAIL. 1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names. 2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer 3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called. 4) When some params go unused, we take a set difference to detect the unused params. 5) Unittests to test the logged unused params, as well as for nested modules, are added ghstack-source-id: 126581051 Test Plan: CI, UT Reviewed By: zhaojuanmao Differential Revision: D27356394 fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332	2021-04-15 09:19:50 -07:00
Brian Hirsh	e8faf69739	fix torch.pow type promotion issue (#54085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54085 Fixes https://github.com/pytorch/pytorch/issues/50121. This fixes two similar issues pointed out with the dtype that `torch.pow` performs its computation. Thanks ngimel for spotting the issues originally (comments [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594624355) and [here](https://github.com/pytorch/pytorch/pull/53669#discussion_r594719704))! Before: ``` >>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0])) tensor([0]) >>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0)) tensor(131072) >>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda')) tensor([131072], device='cuda:0') >>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda')) tensor(131072, device='cuda:0') ``` After: ``` >>> torch.pow(2, torch.tensor([17], dtype=torch.uint8), out=torch.tensor([0])) tensor([0]) >>> torch.pow(2, torch.tensor(17, dtype=torch.uint8), out=torch.tensor(0)) tensor(0) >>> torch.pow(2, torch.tensor([17], dtype=torch.uint8, device='cuda'), out=torch.tensor([0], device='cuda')) tensor([0], device='cuda:0') >>> torch.pow(2, torch.tensor(17, dtype=torch.uint8, device='cuda'), out=torch.tensor(0, device='cuda')) tensor(0, device='cuda:0') ``` In all four cases above, `tensor(0, ...)` is the correct value because the computed "common dtype" among the inputs is expected to be `uint8`. Computing `2 ** 7` in uint8 will then overflow to zero. Finally, we cast the computed output to the output tensor's dtype, which is `int32`. There were two separate issues fixed in this PR: one for cpu and one for cuda: * For CPU, The `pow(Scalar, Tensor)` overload wasn't calling `set_wrapped_number(true)` after wrapping the scalar in a Tensor, which caused the "promoted" scalar to incorrectly participate in type promotion (see the documented behavior [here](`aa8714dfed/c10/core/TensorImpl.h (L590)`)) * For CUDA, the cuda kernels defined in `PowKernel.cu` were using the output's dtype to run the computation, instead of the common dtype. As an aside: The CPU and CUDA kernels actually both use `iter.dtype()` instead of `iter.common_dtype()` to run the computation, which I fixed. The reason that only manifested here for CUDA is because TensorIterator has cpu-specific logic to create temporary outputs with the intermediate dtype (shown [here](`aa8714dfed/aten/src/ATen/TensorIterator.cpp (L349)`)). I'm not sure what the end state is there- I can imagine that being something we're more okay doing for cpu than for cuda, but it also leads to hard-to-track-down inconsistencies between the two like in this case. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27096330 Pulled By: bdhirsh fbshipit-source-id: a7e2909243851625cb3056d1e7abb2383bfe95f2	2021-04-15 08:55:53 -07:00
Howard Huang	5cab3b9cf6	Revert D27709912: TCPStore add watchKey method and new listener thread Test Plan: revert-hammer Differential Revision: D27709912 (`f8f756efb2`) Original commit changeset: 619aa3b2a8eb fbshipit-source-id: 3ef96ccaa76c702d7e5427dfc263531fb1c274ab	2021-04-15 07:43:48 -07:00
Howard Huang	f8f756efb2	TCPStore add watchKey method and new listener thread (#54264 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54264 Changes - Creates new listener thread on each client to run the callback - Create new class which listener thread and master thread derive from, this class is used to handle shut down and clean up of the thread in windows and linux - Add watchKey method and update any functions that changes the key value. Background This PR adds functionality to TCPStore to allow users to watch a key and execute a callback on key change. It introduces this a new watchKey() API: `TCPStore::watchKey(const std::string& key, std::function<void(std::string, std::string)> callback)` which has parameters `key` and `callback(old_key, new_key)` to run on key change. Since current methods are blocking, for example in`TCPStore::get()` a worker will send a "get key" request to the master -> wait for a response back -> then exit the function and return the value to user, we need a non-blocking, asynchronous way to execute the callback whenever a key changes. This is done by creating a new listener thread on each client which the master can communicate with. Right now, the API is C++ only and only for TCPStore, the internal use case is for elastic RPC. We will have an internal key such as `_NumNodes` and all nodes in the elastic RPC group will watch this key. When a node leaves, this key will be updated and each node will execute a callback to clean up Autograd context and RRef context. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27709912 Pulled By: H-Huang fbshipit-source-id: 619aa3b2a8eb23f4be5f5736efdcca6c175aadf3	2021-04-14 13:23:12 -07:00
Rohan Varma	bbc4c775bb	[reland][c10d] monitored_barrier: ensure all ranks pass or none do (#55990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55990 Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master. Disabled these tests for windows, similar to they are disabled on MacOS. The reason for disabling as that they use libuv transport which does not have as robust error handling as tcp on linux. The result is that non-zero ranks that were healthy don't throw immediately (like they do on linux) but they throw on timeout. The error handling still occurs as expected on rank 0 for all platforms. ghstack-source-id: 126478371 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27758424 fbshipit-source-id: d30841c8dda77f51b09a58161e638657ef758e63	2021-04-14 12:26:54 -07:00
Rohan Varma	752f5b1030	[reland][c10d] Log API usage of monitored barrier (#55989 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55989 Reland of https://github.com/pytorch/pytorch/pull/55197, which fails windows test that was only run on master. ghstack-source-id: 126477554 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27758425 fbshipit-source-id: ebca8b6baf0019879bc4b16639d6cccf27dc6b1c	2021-04-14 12:25:35 -07:00
Rohan Varma	48c73d24b8	Revert D27523060: [c10d] monitored_barrier: ensure all ranks pass or none do Test Plan: revert-hammer Differential Revision: D27523060 (`a5290adea5`) Original commit changeset: fa05e4f8ad8a fbshipit-source-id: aa59c1c3ab0ed5b124583a52aed0f93c3b93a05a	2021-04-13 21:33:09 -07:00
Rohan Varma	c7aa1026a8	Revert D27548433: [c10d] Log API usage of monitored barrier Test Plan: revert-hammer Differential Revision: D27548433 (`09231b5db1`) Original commit changeset: 7520ad0948b8 fbshipit-source-id: aa946d8d27472d19c0fe855952ec58d1266ee35a	2021-04-13 21:31:49 -07:00
Rohan Varma	09231b5db1	[c10d] Log API usage of monitored barrier (#55265 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55265 Logs API usage of monitored barrier for better tracking and use case understanding. ghstack-source-id: 126413087 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27548433 fbshipit-source-id: 7520ad0948b8dc9d44fa3118d5ea953d52f9f1c5	2021-04-13 19:02:52 -07:00
Rohan Varma	a5290adea5	[c10d] monitored_barrier: ensure all ranks pass or none do (#55197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55197 From initial user feedback, one unexpected difference between monitored_barrier impl and barrier is the "all or nothing" semantics. In barrier, all ranks pass or they all fail. With monitored barrier however, if rank 1 is healthy, it will respond to both send and recv from rank 0, but rank 0 can later fail because rank 2 is stuck. In this case, rank 1 will move forward out of the barrier. This change makes it so that if a rank fails in monitored barrier, all other ranks in monitored barrier will also fail. It does so by the following process, similar to acknowledgements: Nonzero ranks call send() Nonzero ranks call recv() Rank 0 calls recv(), if this succeeds, rank 0 has acknowledged rank N as healthy Once all ranks are acknowledged as healthy: Rank 0 calls send() to all nonzero ranks to unblock them Modified unittests to ensure the all or nothing failure behavior ghstack-source-id: 126413088 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27523060 fbshipit-source-id: fa05e4f8ad8ae97fd6cb20da5c3a7ef76fd31de6	2021-04-13 19:01:25 -07:00
Yi Wang	132f5c1f36	Clang-format ProcessGroupMPI.cpp (#55969 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55969 Per title ghstack-source-id: 126453717 Test Plan: N/A Reviewed By: zhaojuanmao Differential Revision: D27752173 fbshipit-source-id: e5069b91d699b9d02b12e5dab5e62007dbcee9f0	2021-04-13 17:11:19 -07:00
Yi Wang	de5e3b5eb0	Fix OSS flaky test_destroy_full_group on MPI backend in pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test environment by adding a barrier and retrying MPI_Comm_create 3 times (#55921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55921 Fix this flaky test by adding a barrier and retrying the flaky function call `MPI_Comm_create` 3 times. Couldn't figure out the root cause why `createProcessGroupMPI` can be flaky when just creating a subgroup communicator by mainly invoking `MPI_Comm_create`. Here `createProcessGroupMPI` does not involve any p2p or collective communication at all. Cannot further dig into `MPI_Comm_create`, which is in MPI codebase. Also checked the commit history, and no commit on `ProcessGroupMPI.cpp` can be found within a few days before Mar 10th. First failure (on Mar 10th): https://app.circleci.com/pipelines/github/pytorch/pytorch/283704/workflows/d84ac4a0-42e3-4925-b1cf-32d3c3d1022a/jobs/11456129 Note that the test failure cannot be reproduced locally. Verified the fix on CI: https://app.circleci.com/pipelines/github/pytorch/pytorch/300586/workflows/a5c16db4-3ae2-44c7-a9c8-b0885dad2a64/jobs/12356852 test_destroy_full_group has rerun 100 times and pass. #Closes: https://github.com/pytorch/pytorch/issues/53899 ghstack-source-id: 126414937 Test Plan: ``` export BACKEND=mpi export WORLD_SIZE=2 pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py -vs ``` ``` #!/bin/bash for i in {1..100} do pytest -k test_destroy_full_group test/distributed/test_distributed_fork.py done ``` The CI tests triggered by a new branch: https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=ci-all%2Fwayi_mpi Reviewed By: mrshenli Differential Revision: D27245421 fbshipit-source-id: 86e7fe208e34eda8a33885e385d56ec6b60eca27	2021-04-13 15:28:51 -07:00
Rohan Varma	c218ac3bc0	[NCCL] Join work clean up thread before aborting communicators (#55444 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55444 Changes ~ProcessGroupNCCL so that we join work cleanup thread before aborting nccl communicators. This is because if we abort nccl communicators first on destruction, outstanding work objects in workMetaList can have exceptions set on them. Right now this doesn't trigger errors in nccl async error handling due to the terminated check, but it seems a bit cleaner to just join this thread first. The main motivation is also to reduce log spam since we added some logging when an exception is set on WorkNCCL, but this unexpectedly resulted in a lot of false-positive errors being logged even after pg shutdown. An example is below: I0406 18:30:27.361981 1567104 ProcessGroupNCCL.cpp:527] [Rank 1] NCCL watchdog thread terminated normally I0406 18:30:27.364675 1567105 ProcessGroupNCCL.cpp:265] [Rank 1] found async exception when checking for NCCL errors: NCCL error: unhandled system error, NCCL version 2. 7.3 With this change, we no longer see these false positive logs. ghstack-source-id: 126145284 Test Plan: CI Reviewed By: osalpekar Differential Revision: D27613035 fbshipit-source-id: abf924630128b50e7f66ae41ac83403e7a0aac96	2021-04-13 15:25:22 -07:00
Yanli Zhao	5ffc4e3b0f	refactor prepare_for_backward (#54977 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54977 put part of codes in prepare_for_backward into functions, so that those functions can be used in static graph training and delay all reduce later on. ghstack-source-id: 126366714 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D27439195 fbshipit-source-id: 8899eda621260232d774cb145f9c6d683c47e188	2021-04-13 14:25:29 -07:00
Rohan Varma	657b66e87d	[NCCL] Log when barrier guesses device to use (#54991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54991 Actual proposed fix is in https://github.com/pytorch/pytorch/pull/53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). ghstack-source-id: 126351889 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27444917 fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3	2021-04-13 11:53:55 -07:00
Can Balioglu	339d3bf394	[2/n] [torch/elastic] Introduce `C10dRendezvousBackend`. (#55636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636 This diff introduces: - The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends. - A fix to the `TCPStore.compare_set()` function to support non-existent keys. - A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`. ghstack-source-id: 126312162 Test Plan: Run the existing and newly-introduced unit/integration tests. Reviewed By: tierex Differential Revision: D27654492 fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52	2021-04-12 22:20:27 -07:00
Yi Wang	3e9cbe5ef7	[SPMD] Remove the code branches only used in SPMD mode from distributed.py (#55353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353 Remove all the code branches that will only be executed when `device_ids > 1`. Some helper functions are also removed: 1. `_verify_replicas_within_process` and `verify_replicas_within_process` 2. `_replicate_modules_within_process` 3. `parallel_apply` The next step is deprecating `_module_copies` field. ghstack-source-id: 126201121 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D27552201 fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c	2021-04-09 17:27:56 -07:00
Rohan Varma	0e03a2978a	[DDP] Call ensure_prior_reduction_finished within lock (#55074 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55074 This function accesses member variables that can be modified by different threads (i.e. autograd engine threads), so call it within lock scope. ghstack-source-id: 125707513 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27474526 fbshipit-source-id: 8d43faedd6e6eeeb69e21ce3262337ab83d7ba07	2021-04-05 22:16:13 -07:00
Yi Wang	6a2f046504	[SPMD] Restrict DDP communication hooks to SPSD mode (#55253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253 Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks. The next step is limiting only 1 model replica in Reducer. ghstack-source-id: 125677637 Test Plan: waitforbuildbot Reviewed By: zhaojuanmao Differential Revision: D27533898 fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae	2021-04-05 16:46:47 -07:00
Rohan Varma	19a0eb4cdb	[c10d] Monitored barrier: option to collect all failed ranks (#55010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55010 Follow up change to add a flag to provide an option for monitored barrier to collect all the failed ranks and then throw instead of just throwing on the first one. This is useful as now monitored barrier will be able to pick up on all hanging ranks instead of just one. This is done by passing in a flag `wait_all_ranks=True`. ghstack-source-id: 125699839 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27447787 fbshipit-source-id: ec23aee212060d9eb515ff8adc96c6a17822d1bb	2021-04-04 21:39:54 -07:00
Rohan Varma	0ec1af4b7e	[c10d] Enforce order of waited ranks in monitored barrier. (#55009 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009 Changes monitoredBarrier so that we await acknowledgemenet from ranks in a consistent order (from least to greatest). This will reduce confusion around the order the ranks are awaited. We are still planning to add support for awaiting all ranks in follow up changes. ghstack-source-id: 125699838 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27405417 fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71	2021-04-04 21:38:25 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Yi Wang	322854d2f0	[SPMD] Error out SPMD in C++ Reducer (#55212 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212 Error out SPMD in C++ Reducer. Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor. Removed 2 tests relevant to reducer in SPMD mode: `test_ddp_comm_hook_multiple_replica_check` `test_forward_backward_multi_replica` ghstack-source-id: 125602472 Test Plan: waitforbuildbot Reviewed By: pritamdamania87 Differential Revision: D27497747 fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562	2021-04-02 22:59:25 -07:00

1 2 3 4 5 ...

1403 Commits