pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Rohan Varma	0ec1af4b7e	[c10d] Enforce order of waited ranks in monitored barrier. (#55009 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55009 Changes monitoredBarrier so that we await acknowledgemenet from ranks in a consistent order (from least to greatest). This will reduce confusion around the order the ranks are awaited. We are still planning to add support for awaiting all ranks in follow up changes. ghstack-source-id: 125699838 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27405417 fbshipit-source-id: b9a3e72742cbffdd9bf890ab2c94103b768a7b71	2021-04-04 21:38:25 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Yi Wang	322854d2f0	[SPMD] Error out SPMD in C++ Reducer (#55212 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55212 Error out SPMD in C++ Reducer. Added a new test `test_reducer_no_multi_replicas`, which checks no multiple replicas are allowed at the Reducer constructor. Removed 2 tests relevant to reducer in SPMD mode: `test_ddp_comm_hook_multiple_replica_check` `test_forward_backward_multi_replica` ghstack-source-id: 125602472 Test Plan: waitforbuildbot Reviewed By: pritamdamania87 Differential Revision: D27497747 fbshipit-source-id: 17ef1bc4d889cbe8076bcb3d504aed4c1aea1562	2021-04-02 22:59:25 -07:00
Rohan Varma	3575e71be8	[DDP Logging] Log use of uneven inputs API (#54919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54919 Log the use of uneven inputs API for better tracking and use case detection. ghstack-source-id: 125446499 Test Plan: CI, added ut Reviewed By: zhaojuanmao, SciPioneer Differential Revision: D27410764 fbshipit-source-id: abc8055a2e15a3ee087d9959f8881b05a0ea933e	2021-04-01 16:22:32 -07:00
Rohan Varma	d5564618d0	[NCCL][Blocking Wait] Log set exceptions when checking for exceptions in (#54558 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54558 In blocking wait's polling synchronization loop, we frequently call checkAndSetException() as part of isCompleted() to check the status of nccl operations. It would be useful to log here in case we encounter any exceptions (which are later thrown by `checkAndThrowException`). Also slightly refactors code previously added to make use of a helper function to get the error message given an `std::exception_ptr`. ghstack-source-id: 125124314 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D27136202 fbshipit-source-id: 256eb63c5c2a84be909722d3fd7377ad9303fa11	2021-03-29 14:15:45 -07:00
Rohan Varma	028d2d6e63	[NCCL] Enhance watchdog to log exceptions (#54557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54557 When looping through the nccl communicator cache checking for errors, enhance the watchdog to log exceptions that are set on the communicator. This will allow for better debugability since the NCCL error will be logged when the watchdog receives errors for the communicators and aborts them appropriately. Tested by forcing a NCCL error with NCCL_BLOCKING_WAIT=1 and verifying that the exception is indeed logged. ghstack-source-id: 125124310 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27106699 fbshipit-source-id: 1d2bd9f057a3796ce15dd8a4ce34cf6899eee45c	2021-03-29 14:15:42 -07:00
Rohan Varma	4541f60390	Gloo-only CPU-based monitored barrier (#53773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53773 Closes https://github.com/pytorch/pytorch/issues/52876 Implements a barrier by doing send/recv to rank 0, and rank 0 waits for these requests and on timeout, throws an exception indicating which rank did not join in the given timeout. This barrier is only intended for CPU use cases and built into process group gloo, and will be used for debugging synchronization/hang issues. Test Plan: Added UT Reviewed By: zhaojuanmao Differential Revision: D26921357 fbshipit-source-id: 7c16e861b4b8ea2bdd67a36b3de7b1029af7d173	2021-03-29 14:14:10 -07:00
Rohan Varma	5c3d80d8fa	[DDP] Mark a few variables as const in reducer (#54764 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54764 We mark a few vars as const in Reducer, also do this for replicas_ and process_group_ as they should not be changed by Reducer during training. This can help eliminate issues at compile time and prevent the developer from accidently changing these variables. ghstack-source-id: 125040110 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27357132 fbshipit-source-id: 23a0edf754a8e4f9e6440e99860e5549724cb7ad	2021-03-27 21:40:18 -07:00
Rohan Varma	671f80a313	[c10d] s/torch::autograd::variable/at::Tensor/g (#54763 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54763 Replaces deprecated torch::autograd::variable with at::Tensor. torch::autograd::variable is defined as equal to at::Tensor now so this should be a noop, but follows convention of using tensor instead of Variable. ghstack-source-id: 125040109 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27356450 fbshipit-source-id: 1a001358d7726a597141ec47803c8213db4814c0	2021-03-27 21:38:51 -07:00
Wenlei Xie	593295daac	Migrate kernels with TensorOptions to C10 full dispatcher (#54539 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54539 Codemod commands generated by https://github.com/pytorch/pytorch/pull/54468 ghstack-source-id: 125018630 # Facebook: The following 2 files are changed on fb side: ``` // Should be hidden ``` Test Plan: buck build //caffe2/aten/... Reviewed By: smessmer Differential Revision: D27273744 fbshipit-source-id: 35c1bff63189477645008caaf0dc794096e3fcc4	2021-03-26 13:55:22 -07:00
Michael Carilli	1442a92741	Ensure local_used_maps_tmp is distinct from local_used_maps_[i] (#54474 ) Summary: Followup/hotfix for https://github.com/pytorch/pytorch/pull/53160. rohan-varma and zhaojuanmao were seeing https://github.com/pytorch/pytorch/pull/53160/files#diff-9273e5ff7b40f30d6a4444d1c7be9fe9a5c2068070c68af4e7b0ac2d4cff0923R582 fire in some internal workloads, indicating `local_used_maps_tmp` wasn't actually being created as a distinct temporary, in other words, `local_used_maps_[i]` was already pinned for some reason. This seems like a bug with the CPU allocator: [`local_used_maps_` should not have been pinned on construction](`9be4c75fa0/torch/lib/c10d/reducer.cpp (L180-L183)`). We should [investigate that separately](https://github.com/pytorch/pytorch/pull/53160/files#r599188373). In the meantime, the present PR should ensure `local_used_maps_tmp` is always distinct from `local_used_maps_[i]` (and therefore prevents the race condition described in https://github.com/pytorch/pytorch/pull/51360) even if `local_used_maps_[i]`is already pinned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/54474 Reviewed By: zhaojuanmao Differential Revision: D27268039 Pulled By: rohan-varma fbshipit-source-id: ab9af3dd845098bde788cb28a9217caea246ddfa	2021-03-24 16:58:31 -07:00
Rohan Varma	789dc6d445	[NCCL] Add more details for checkForNCCLErrors (#54117 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54117 https://github.com/pytorch/pytorch/pull/45950 enhanced our NCCL logging errors so that we add some basic debug information about what when wrong when erroring out with a NCCL error. However, that PR only used the added function for `C10D_NCCL_CHECK` which is used to check the return values of NCCL calls. However, in ProcessGroupNCCL we also have `checkForNCCLErrors` which checks for errors on nccl communicators, and in case of errors it would be good to have this logging there too. Also renames the function s/errorMessage/getNcclErrorDetailStr ghstack-source-id: 124662592 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D27100497 fbshipit-source-id: fec3663ffa3e92bae8391ef4f77054abb4bb9715	2021-03-23 20:29:16 -07:00
Brian Hirsh	bc4f521178	port at::mul to structured (#52692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692 Porting `at::mul` to structured. One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. Technically, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D27029822 Pulled By: bdhirsh fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4	2021-03-19 11:34:33 -07:00
Wanchao Liang	f4a044ca1d	[distributed] add options field in ProcessGroupGloo/NCCL (#54090 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54090 This PR adds an options field to both ProcessGroupGloo/NCCL so that we have a constant `options` field even after the initialization of ProcessGroup, which gives us the ability to inspect the options during construction of specific ProcessGroup. Also use options inside different methods instead of separate fields. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D27093670 Pulled By: wanchaol fbshipit-source-id: b02d9394290e9be88b21bddb94d4de7993b4a2e3	2021-03-17 18:41:55 -07:00
Wanchao Liang	a4f0f8b1e9	[distributed] add base processgroup::options (#53662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53662 Add a base processgroup::options so that we can do inheritance and provide a universal option API in python Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D26968856 Pulled By: wanchaol fbshipit-source-id: 858f4b61b27aecb1943959bba68f8c14114f67d8	2021-03-17 18:40:04 -07:00
Michael Carilli	ce40ff5c64	Avoid DDP race condition with find_unused_parameters=True when all params are used (#53160 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53159. See comments for a description of the race condition. Thanks to ptrblck xwang233 and especially zasdfgbnm for lots of help isolating the problem and discussing the fix. PRing for discussion. We can try to concoct a dedicated test for the problem if you want. The ingredients are: - DDP(..., find_unused_parameters=True) - Use all the DDP-ed model's params in forward such that the "lazy local used work wait()" path will be taken in backward - Queue up a lot of asynchronous dummy work just before backward(), so stream work gets pushed far into the future relative to CPU work Benchmark: Bert model, When find_unused_parameters=true, latency (sec) per iteration P50: trunk-1.265sec, this PR-1.263sec, if add blocking copy before calling local_used_.fill(i)-1.236 sec Bert model, When find_unsued_parameters=false, latency (sec) per iteration P50: trunk-1.00sec, this PR-1.026sec Resnet50 model, accuracy is also matched with trunk when find_unused_parameters=true and false Pull Request resolved: https://github.com/pytorch/pytorch/pull/53160 Reviewed By: albanD Differential Revision: D26916766 Pulled By: zhaojuanmao fbshipit-source-id: 3e0ed91b7b5c42e2f2c82e12d4d2940fdc89e023	2021-03-17 10:08:22 -07:00
Rohan Varma	f52a3bd634	[DDP] remove dedupe check in reducer (#53919 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53919 https://github.com/pytorch/pytorch/pull/53279/files has landed deduplicating the shared params in python before constructing reducer. Because of this, we no longer need the changes in https://github.com/pytorch/pytorch/pull/46755/files. This is already tested by `test_ddp_shared_grad_acc_unused_params` and `test_ddp_weight_sharing` ghstack-source-id: 123828299 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D27015466 fbshipit-source-id: efb079540c1a0e18bb38e68479caeb50cf550304	2021-03-15 18:50:05 -07:00
Howard Huang	7f88840495	Fix prefix store timeout bug (#53928 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53928 HashStoreTest was taking forever to run. Turns out it was because a default timeout is set when creating Store() and setTimeout for prefixStore is not actually able to change the timeout of the underlying store. After removing the default timeout and updating setTimeout, this will save ~10 minutes for all of the gcc_test CI runs. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27025275 Pulled By: H-Huang fbshipit-source-id: 650c8c1eb8b166da1d412ed88e765747a2ca2069	2021-03-15 13:23:20 -07:00
Thomas Viehmann	8734e88f0b	delete has no more data after the key (#53886 ) Summary: The tcpstore delete key implementation inadvertendly set "moreData" when sending the key when it was in fact the last message. Thank you, PetrochukM, for the reproducing example which was instrumental in developing the fix (and is the blueprint for the test case). Fixes https://github.com/pytorch/pytorch/issues/53872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53886 Reviewed By: jbschlosser Differential Revision: D27011846 Pulled By: H-Huang fbshipit-source-id: 5c460d1e4d095a8bc267bf63613b556856ced3e8	2021-03-15 08:44:55 -07:00
Isaac Seessel	3078233e9a	[Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808 Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm. Test Plan: Unit test: ``` buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper ``` Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression 1) FP16 Compression: f256897690 2) FP16 Wrapper + AllReduce (after patching D26960986): f256897289 Reviewed By: SciPioneer Differential Revision: D26978832 fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0	2021-03-12 17:31:07 -08:00
Siva Datta Mannava	fdbd667e31	compareSet method for HashStore and FileStore (#53803 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/53062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/53803 Reviewed By: ngimel Differential Revision: D27017014 Pulled By: H-Huang fbshipit-source-id: 736aa5ad848f5708e6581e472e48d5682bef7131	2021-03-12 12:38:30 -08:00
Howard Huang	4873641602	Fix TCPStore wait() hang when key is previously set (#53860 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53860 Fixes [#53840](https://github.com/pytorch/pytorch/issues/53840) Right now [TCPStore wait([LIST_OF_KEYS_TO_AWAIT])](https://pytorch.org/docs/master/distributed.html#torch.distributed.Store.wait) will hang if any of the keys in [LIST_OF_KEYS_TO_AWAIT] has been previously set. This change will ensure that wait() is only waiting for the keys that have not been set Before change: ``` # Case 1: HANG store.set("1", "1") store.wait(["1", "2"]) store.set("2", "2") # Case 2: SUCCEED store.wait(["1", "2"]) store.set("1", "1") store.set("2", "2") ``` After change: Both cases work TODO: working on adding a test for wait() Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D26999929 Pulled By: H-Huang fbshipit-source-id: 8931749923c98b520366538f785af82ef37cca8e	2021-03-12 07:05:31 -08:00
cyy	14d02517e1	replace data with data_ptr (#53097 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53097 Reviewed By: albanD Differential Revision: D26972445 Pulled By: rohan-varma fbshipit-source-id: 04798a3fd55dd297638377513cfc57ff86c8916d	2021-03-11 13:14:35 -08:00
Rohan Varma	fa980bb22a	[wip][Dist Profiling] Enable dist profiling for MPI backend (#52949 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52949 Enables distributed profiling which we have for gloo and nccl for the MPI backend ghstack-source-id: 123610105 Test Plan: CI Reviewed By: wanchaol Differential Revision: D26591590 fbshipit-source-id: a20ec9d104faa26bc62c727dd01319c3ea230f5d	2021-03-11 13:08:41 -08:00
Yanli Zhao	a76b4736db	clang format reducer and logger files (#53148 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53148 clang format reducer and logger files ghstack-source-id: 123453983 Test Plan: unit test Reviewed By: SciPioneer Differential Revision: D26764509 fbshipit-source-id: 711efcfd77420f912861cfd20c69e3af5086f4b9	2021-03-10 11:35:30 -08:00
Yanli Zhao	d032287ec3	fix data type logging (#53162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53162 it is possible there are multiple data types in mixed precision training, so log data types as a list of data type names. ghstack-source-id: 123452626 Test Plan: unit test Reviewed By: SciPioneer Differential Revision: D26769256 fbshipit-source-id: 8f7d73821e89864fedbbce723f301fe8fbad5685	2021-03-10 11:35:26 -08:00
Yanli Zhao	7d4b229d61	add is_multi_device_module logging field (#53149 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53149 add is_multi_device_module logging field ghstack-source-id: 123444621 Test Plan: unit test Reviewed By: SciPioneer Differential Revision: D26765355 fbshipit-source-id: d4d9c5981b18b1744299aebe8af37eb4e2e35c61	2021-03-10 11:35:22 -08:00
Yanli Zhao	a08fc1a7fc	allow users to set sample rate and add per iteration latency breakdowns (#53145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53145 add a new API to allow users to set sample rate for runtime stats, also add per iteration latency breakdowns to DDPLoggingData struct. e.g. if users set sample rate to be 1, they can analyze per iteration latency change over time (not avged) ghstack-source-id: 123443369 Test Plan: unit test Reviewed By: SciPioneer Differential Revision: D26763957 fbshipit-source-id: baff6a09c2a590e6eb91362ca6f47ae8fa6ddb0e	2021-03-10 11:35:18 -08:00
Michael Carilli	e787872a47	[RELAND] Deduplicate shared params before constructing Reducer in DDP (#53279 ) Summary: Original PR https://github.com/pytorch/pytorch/pull/51929 seemed to trigger failures in `pytorch_linux_xenial_py3_clang5_asan_test2`. Resubmitting to figure out why, and hopefully reland. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53279 Reviewed By: mrshenli Differential Revision: D26916701 Pulled By: zhaojuanmao fbshipit-source-id: 75c74c8ad8ad24154eb59eddb2b222da0a09897e	2021-03-10 07:56:20 -08:00
Rohan Varma	14fa47631b	[DDP Logging] Log comm. hook in ddp logging (#52966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52966 Logs registerd comm hook if there is one, else logs "builtin_allreduce" ghstack-source-id: 123174803 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D26709388 fbshipit-source-id: 484fdbbd6643ec261b3797bd8d9824b2b6a1a490	2021-03-05 11:23:26 -08:00
Rohan Varma	5d9b7bee1a	[DDP Logging] Log nccl_async_error_handling (#52965 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52965 Logs nccl async error handling in ddp logger ghstack-source-id: 123171876 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26709030 fbshipit-source-id: 530456a5005b8e4956d7fb023986e9b948ebe1a8	2021-03-05 11:23:22 -08:00
Rohan Varma	bdbfc2582d	[Dist Debugality] Log key DDP metrics to stderr under debug mode. (#52957 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52957 This diff: 1. Under TORCH_DISTRIBUTED_DEBUG=INFO or DETAIL, logs DDP information during init time (all stats in ddp_logging_data_) 2. Under TORCH_DISTRIBUTED_DEBUG=DETAIL, logs runtime stats when they are collected (first 10 iterations and then once every 100 iterations). Avoiding logging every iteration to not spam logs. Verified by inspecting logs: ``` I0226 19:12:47.109243 2818475 logger.cpp:140] [Rank 1]: DDP Initialized with: world_size: 2 module_name: Linear device_ids: 1 output_device: 1 backend_name: nccl parameter_dtype: float total _parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo ut: N/A I0226 19:12:47.109252 2818473 logger.cpp:140] [Rank 0]: DDP Initialized with: world_size: 2 module_name: Linear device_ids: 0 output_device: 0 backend_name: nccl parameter_dtype: float total _parameter_size_in_bytes: 40 num_parameter_tensors: 2 bucket_sizes: 40 CUDA_VISIBLE_DEVICES: N/Abroadcast_buffer s: 1 bucket_cap_mb: 25 find_unused_parameters: 0 gradient_as_bucket_view: 0 Backend Info: nccl_socket_ifname: N/A nccl_blocking_wait: N/A nccl_debug: WARN nccl_nthreads: N/A nccl_ib_timeo ut: N/A ``` ``` I0226 19:12:48.117936 2818473 logger.cpp:286] [Rank 0 / 2] Training Linear unused_parameter_size=0 Avg forward compute time: 568944 Avg backward compute time: 885504 Avg backward comm. time: 692496 Avg backward comm/comp overlap time: 113536 I0226 19:12:48.118517 2818475 logger.cpp:286] [Rank 1 / 2] Training Linear unused_parameter_size=0 Avg forward compute time: 565584 Avg backward compute time: 876992 Avg backward comm. time: 201872 Avg backward comm/comp overlap time: 128624 ``` ghstack-source-id: 123171875 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26708184 fbshipit-source-id: 16defd5610d28bc4cf3fc2a0cc564e84efcfa791	2021-03-05 11:23:18 -08:00
Rohan Varma	68134374cb	Refactor/fix DDP model check during init (#52887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887 This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP. There were a few things that could be improved with the way we verify model across processes in DDP initialization: 1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters. 2. Refactor DDP init in the following ways: - Run model consistency check before creating reducer, 2 - add helper functions to build params to pass into reducer - add helper function to call `_verify_model_across_ranks` - move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors). ghstack-source-id: 123171877 Test Plan: CI/unittest buck test mode/dev-nosan //caffe2/test/distributed:c10d BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks Reviewed By: zhaojuanmao Differential Revision: D26565290 fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608	2021-03-05 11:21:45 -08:00
Mike Ruberry	30a8a13a7d	Revert D26625807: [pytorch][PR] Deduplicate shared params before constructing Reducer in DDP Test Plan: revert-hammer Differential Revision: D26625807 (`5c15a5bb46`) Original commit changeset: f5f5959fef90 fbshipit-source-id: c875cc86b8fd21d9d64f934559f8e3126ed1d23d	2021-03-03 20:05:47 -08:00
Yi Wang	510c03d922	[Gradient Compression] Remove some low-level methods of GradBucket class (#53098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098 Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class. Avoid unnecessary exposure to the internals before publishing GradBucket APIs. ghstack-source-id: 122979064 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: osalpekar Differential Revision: D26784249 fbshipit-source-id: d1b27bb026989c25a5b65be4767cb752afd6f19b	2021-03-03 12:06:14 -08:00
Michael Carilli	5c15a5bb46	Deduplicate shared params before constructing Reducer in DDP (#51929 ) Summary: Currently, `torch.nn.parallel.DistributedDataParallel(model...)` doesn't deduplicate params shared across `model`'s child Modules before calling Reducer with the param list. This can cause Reducer to register more than one hook on the shared param(s), at which point who knows what happens. We ran into this in mlperf BERT, which has at least one param shared across submodules (an embedding weight iirc, not 100% sure). Running with `gradient_as_bucket_view = False` produced different numerics from running with `gradient_as_bucket_view = True` (which i guess is one potential consequence of multiple DDP hooks on a given param, not sure why, i'd have to dig further). This PR changes DDP to deduplicate shared params (a small diff), and adds some tests (right now just `test_ddp_weight_sharing`, but I'll add more). `test_ddp_weight_sharing` fails with bad numerics on current master (proving the shared param issue is real) and passes with the deduplication diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/51929 Reviewed By: zou3519 Differential Revision: D26625807 Pulled By: zhaojuanmao fbshipit-source-id: f5f5959fef90dfe2c55812d79fa88b877f22ecc3	2021-03-03 10:13:24 -08:00
Omkar Salpekar	593b0fbade	Revert D26720919: [Gradient Compression] Remove some low-level methods of GradBucket class Test Plan: revert-hammer Differential Revision: D26720919 (`521e1e83ea`) Original commit changeset: 46fb64230087 fbshipit-source-id: e2b68892d1735b7249b4d36f3dff57160c9cbc78	2021-03-02 16:18:39 -08:00
Yi Wang	521e1e83ea	[Gradient Compression] Remove some low-level methods of GradBucket class (#53098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53098 Remove some low-level methods that are no longer needed since `get_per_parameter_tensors` method is added to `GradBucket` class. Avoid unnecessary exposure to the internals before publishing GradBucket APIs. ghstack-source-id: 122723683 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D26720919 fbshipit-source-id: 46fb6423008792e72d7a1dd68930a31e0724c92c	2021-03-02 14:39:19 -08:00
Yi Wang	b05dd931ee	[Gradient Compression] Add is_the_last_bucket_to_allreduce method to GradBucket class (#53010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53010 To determine the boundary between different iterations in a DDP communication hook, currently the user code needs `bucket.get_index() == 0`, which involves internal bucketization implementation details and undermines the usability of DDP communication hook. Create an API to hide the details and improve the usability before publishing GradBucket APIs. ghstack-source-id: 122723081 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D26720813 fbshipit-source-id: f4a3147382c1f970534d7f0dee0cd599156c8b8c	2021-03-02 14:39:12 -08:00
Yi Wang	4997c38a15	[Gradient Compression] Don't provide default values in GradBucket constructor (#53102 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53102 In `GradBucket` constructor, `offsets`, `lengths`, and `sizes_vec` are optional arguments and could possibly be empty. It will be safe to remove the default values. ghstack-source-id: 122833603 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D26748199 fbshipit-source-id: 2e3bcd1b732851919a64bbbd20fe85e77a616fe3	2021-03-02 14:39:07 -08:00
Yi Wang	ecb5ac90ed	[Gradient Compression] Add get_per_parameter_tensors method to GradBucket class (#53009 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53009 It can be a common operation to apply layer-wise operations over per-parameter tensors in a DDP communication hook. Create a util method in GradBucket class before publishing GradBucket APIs. ghstack-source-id: 122833594 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl f254364097 Reviewed By: rohan-varma Differential Revision: D26717893 fbshipit-source-id: 916db319de8b85dd22bc4e35db5671bf4e34740f	2021-03-02 14:39:03 -08:00
Rohan Varma	b3bf08e67f	Log nccl debug level in ProcessGroupNCCL (#52803 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52803 This is useful for double checking we have the expected nccl_debug level when debugging problematic jobs. New logs: When default is warn: ``` NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 60000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: WARN ``` off: ``` NCCL_ASYNC_ERROR_HANDLING: 0 NCCL_BLOCKING_WAIT: 0 TIMEOUT(ms): 1800000 USE_HIGH_PRIORITY_STREAM: 0 NCCL_DEBUG: N/A ``` ghstack-source-id: 122751110 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D26653699 fbshipit-source-id: 845cc1236f3838f4763c6dcf2a30d059b3d44f02	2021-03-01 14:57:22 -08:00
Rohan Varma	a3cd881890	Fix grammar in reducer warning (#52835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52835 Addresses comment in https://github.com/pytorch/pytorch/pull/52385 that was missed before landing the PR ghstack-source-id: 122543534 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D26660764 fbshipit-source-id: 3edfebed56f382c1414ba9eb65a753ced7e34154	2021-02-25 22:29:52 -08:00
Can Balioglu	94da8b9816	Fix resource leak bug in TCPStore constructor (#52860 ) Summary: This PR fixes a resource leakage bug in the constructor of `TCPStore` where an exception thrown in `TCPStoreDaemon` or `tcputil::connect()` can leave the server socket dangling. The ideal long-term solution would be to have a RAII wrapper for TCP sockets returned by `tcputil`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52860 Reviewed By: osalpekar Differential Revision: D26671775 Pulled By: cbalioglu fbshipit-source-id: ccebbd7533ac601a4b80e6e759f2fb4fe01c70fa	2021-02-25 15:32:38 -08:00
Can Balioglu	a11b601100	Expose Store's timeout and TCPStore's host and port in Python API (#52784 ) Summary: This PR introduces the `timeout` accessor to `Store` and `host`, `port` accessors to `TCPStore` to help testing and troubleshooting higher level APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52784 Reviewed By: anjali411 Differential Revision: D26648202 Pulled By: cbalioglu fbshipit-source-id: 9cf23bf998ed330d648dfec2a93e1bbb50817292	2021-02-25 11:05:15 -08:00
Richard Barnes	373a20ad4a	Modernize for-loops in caffe2/torch (#52618 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50797 Modernize for-loops throughout caffe2/ subdirs to use ranged-loops where possible (all `.cpp` files were examined). ``` find caffe2/ -iname ".cpp" > /home/rbarnes/files buck run mode/opt foundation/clangr:clangr_local -- -j 10 --file=/home/rbarnes/files --multi --apply-replacements=true tidy '--checks=-,modernize-loop-convert' ``` Test Plan: Sandcastle tests Reviewed By: suo Differential Revision: D26585065 fbshipit-source-id: 439b9f9ce7c54fa9b4b80161f6bb27ebe8a35967	2021-02-24 18:17:46 -08:00
Can Balioglu	3489b4a7b8	Fix the ordering of TCPStore's compare_set parameters (#52696 ) Summary: - Fixes the ordering of the value parameters of TCPStore's `compare_set()` in the pybind11 interop layer. The C++ API expects (old, new) while we are passing (new, old) in Python. - Fixes the implementation of TCPStore's `compareSetHandler()` for cases where the key already exists in the store. Pull Request resolved: https://github.com/pytorch/pytorch/pull/52696 Test Plan: `python test/distributed/test_c10d.py` Reviewed By: malfet, H-Huang Differential Revision: D26616976 Pulled By: cbalioglu fbshipit-source-id: e6a70542e837be04697b5850947924edd896dbf6	2021-02-24 06:59:03 -08:00
Richard Barnes	2eb9c0832e	Modernize for-loops in torch misc (#52452 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52452 Test Plan: Sandcastle Reviewed By: pritamdamania87 Differential Revision: D26520760 fbshipit-source-id: c13161324f24f553ad679308d0dc279ab178e129	2021-02-22 13:37:19 -08:00
Rohan Varma	ef8d17e112	[DDP] Separate error messages for unused params in forward and not all outputs (#52391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52391 There are 2 ways DDP can throw the exception refactored here - 1) Unused params in the forward pass. We provide `find_unused_parameters=True` for this. 2) All params used in fwd pass, but not all outputs used in loss computation. There are a few workarounds for this but we do not provide native support. Previously, these 2 issues were combined into 1 error message but that has historically resulted in confusion, with users reporting getting this error even when they enable `find_unused_parameters=True` (which they expect to fix this error). As a result there is additional churn to debug these issues because the true cause (1) vs (2) is not known. This commit helps to fix the issue by separating out the 2 error messages depending on if we ran with unused parameter detection or not. Hopefully this should make the error message much more clear and actionable. error msg with `find_unused_params=True`: ``` RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. Since `find_unused_parameters=True` is enabled, this likely means that not all `forward` outputs participate in computing loss. You can fix this by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). ``` error msg without `find_unused_params` specified: ``` RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). ``` ghstack-source-id: 122097900 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26496688 fbshipit-source-id: 4a9eeeda10293da13d94a692d10cb954e4506d7c	2021-02-19 17:09:22 -08:00
Howard Huang	bc6852c192	Change TCPStore world_size and is_master to be optional (#51809 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51809 Changes to TCPStore which will make world_size and is_master optional parameters for initialization. API before change: ```python # arguments: host_name, port, world_size, is_master, timeout=300s server_store = dist.TCPStore("127.0.0.1", 0, 2, True) client_store = dist.TCPStore("127.0.0.1", 0, 2, False) ``` API after change: ```python # arguments: host_name, port, world_size=-1, is_master=False, timeout=300s server_store = dist.TCPStore("127.0.0.1", 0, is_master=True) client_store = dist.TCPStore("127.0.0.1", 0) ``` Test Plan: Imported from OSS Reviewed By: heitorschueroff Differential Revision: D26461770 Pulled By: H-Huang fbshipit-source-id: 5b2157029c73e8706e158cd49ecce60c9f3a7f41	2021-02-19 09:56:51 -08:00
Yanli Zhao	d0795ab358	log newly added construction and runtime stats at randomly selected iterations (#51394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51394 log newly added construction and runtime stats at randomly selected iterations ghstack-source-id: 121934040 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D26161885 fbshipit-source-id: add6e02c1a03e6f74f08b9a9aecf90fa81631d60	2021-02-19 00:15:04 -08:00
Yanli Zhao	c75fa39b6c	add stats that can only be collected at runtime (#51386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51386 add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data 1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices 2. use at::cuda::event API for safer calls 3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results. 4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls ghstack-source-id: 121933566 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D26158645 fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178	2021-02-19 00:13:11 -08:00
Rohan Varma	0c46b6b3f6	[DDP] Enhance warning for find_unused_params (#52385 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52385 This warning should specify that we did not find unused params in the _forward_ pass, which is when we log this warning. This is to avoid confusion when we get an error because not all outputs were used to compute loss, which also raises an error about unused parameters (to be fixed in the next diff) ghstack-source-id: 122001929 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26494136 fbshipit-source-id: d9b41732ea7e5e31b899d590d311080e3dc56682	2021-02-18 23:36:08 -08:00
Rohan Varma	6dabe0b291	[Dist Profiling] Enable dist profiling for DDP (gloo only) (#52031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52031 Closes https://github.com/pytorch/pytorch/issues/52020 Ensures that we can profile collectives in DDP by propagating the profiler threadLocalState appropriately. As described in the above issue, before this wouldn't work as the profiler would only be enabled on the main thread. ghstack-source-id: 121818080 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D26356192 fbshipit-source-id: 0158b5833a3f857a0b4b2943ae3037e9d998dfd1	2021-02-17 12:21:37 -08:00
Rohan Varma	7b21c6be67	[Dist Profiling] Enable profiling for gloo send/recv (#52004 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52004 Enables profiling of p2p collectives for Gloo. Modified/added relevant unittests. ghstack-source-id: 121507511 Test Plan: CI Reviewed By: mrzzd Differential Revision: D26347164 fbshipit-source-id: f4d1c474fccf40d5776fc13c4add7a053ea08960	2021-02-12 13:46:51 -08:00
Rohan Varma	4c93a79a04	[Dist Profiling] Support shape recording for profiling collectives (#51822 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51822 Adds support for shape recording for profiling distributed collectives, for nccl/gloo backends. Added both cpp and python tests to ensure that shapes are recorded properly. Note that we don't add `ProcessGroupNCCLTest`s since they need to be modified to support single process per device and > 1 world size. ghstack-source-id: 121507509 Test Plan: CI Reviewed By: mrzzd Differential Revision: D26291739 fbshipit-source-id: 5f7bd54d8c36d17a4a29e172b25266ca3dbd8fbd	2021-02-11 12:42:26 -08:00
Richard Barnes	fa325d7c9f	Use `sum_integers` and `multiply_integers` (#51146 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51146 Test Plan: Sandcastle tests Reviewed By: ngimel Differential Revision: D25903430 fbshipit-source-id: 329c14018c9e5192864eed88a8ed0a5068ff1c69	2021-02-10 18:05:45 -08:00
Yanli Zhao	18e0a61388	add more logging fields that can be set in construction time (#51260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51260 add more logging fields to DDPLoggingData, including param stats, bucket stats, environment variables, nccl version, data type ghstack-source-id: 121260224 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D26118245 fbshipit-source-id: ba48b7a11340bda1f5f3b24c8603545d346361e9	2021-02-09 21:58:58 -08:00
Howard Huang	97e35858ec	[Resubmit] Add compare_set operation and test to TCPStore (#51815 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51815 This is resubmission of #51593, already approved. Test Plan: Imported from OSS Reviewed By: izdeby Differential Revision: D26316875 Pulled By: H-Huang fbshipit-source-id: d81cb131ef6b9e2ebaee32bb505dfc11235bc29d	2021-02-08 13:44:31 -08:00
Yi Wang	5a962369e2	[Gradient Compression] Check if the backend is NCCL when a DDP communication hook is registered (#51759 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51759 Some unit tests actually register a comm hook on other backends like GLOO. Example: `test_ddp_comm_hook_future_passing_cpu` Therefore, only do the check on `register_builtin_comm_hook`. Currently DDP communication hook can only be supported on NCCL. Add a check in the registration methods. ghstack-source-id: 121115814 Test Plan: unit tests. Reviewed By: pritamdamania87 Differential Revision: D26268581 fbshipit-source-id: c739fa4dca6d320202dc6689d790c2761c834c30	2021-02-05 09:59:12 -08:00
Howard Huang	62aea33d7f	Revert D26237328: Add compare_set operation and test to TCPStore Test Plan: revert-hammer Differential Revision: D26237328 (`7d00aec6bc`) Original commit changeset: c6837a4cc34f fbshipit-source-id: 662f8067ead9bce0da13b35d393fb781635dd2b9	2021-02-04 13:43:05 -08:00
Howard Huang	7d00aec6bc	Add compare_set operation and test to TCPStore (#51593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51593 Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D26237328 Pulled By: H-Huang fbshipit-source-id: c6837a4cc34f8247df6e1c29c1f40fd9e7953313	2021-02-04 10:36:58 -08:00
Omkar Salpekar	3361d365bd	[Gloo] Use TORCH_CHECK for ensuring tag is nonnegative (#51370 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51370 TORCH_CHECK should be used when confirming the correctness of function arguments like the tag passed to Gloo functions. ghstack-source-id: 120908449 Test Plan: Sandcastle/CI Reviewed By: mingzhe09088 Differential Revision: D26152359 fbshipit-source-id: ddffaa6f11393aaedaf0870759dc526d8d4530ee	2021-02-03 11:48:20 -08:00
Yanli Zhao	e54cbb8250	Create PyTorch DDP logging APIs for applications to use (#50637 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50637 add APIs for logging pytorch ddp logging data in applications. Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D25933411 fbshipit-source-id: 57c248a2f002da06a386fc7406d3e5533ebb9124	2021-02-02 18:24:21 -08:00
Yanli Zhao	d5541c50a3	add a c++ interface in processGroup to get its backend name (#51066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51066 backend name of a processgroup created using distributed_c10d python API is tracked, but there is no good way to track name of a processgroup created using processGroup c++ API. In some cases, knowing backend name of a processGroup is useful, e,g., log the backend name, or write some codes that have dependency on the known backend. ghstack-source-id: 120628432 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D26059769 fbshipit-source-id: 6584c6695c5c3570137dc98c16e06cbe4b7f5503	2021-01-29 17:28:42 -08:00
Yanli Zhao	250c71121b	Create a DDPLoggingData and expose it to python interface (#50622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50622 1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields 2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files 3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications 4. Unit test tested the logging data can be set and got as expected 5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc ghstack-source-id: 120275870 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D25930527 fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99	2021-01-25 15:23:07 -08:00
Xiang Gao	44922f26f5	Add support for NCCL alltoall (#44374 ) Summary: In https://github.com/pytorch/pytorch/issues/42514, NCCL `alltoall_single` is already added. This PR adds NCCL `alltoall`. The difference between `alltoall_single` and `alltoall` is: `alltoall_single` works on a single tensor and send/receive slices of that tensor, while `alltoall` works on a list of tensor, and send/receive tensors in that list. cc: ptrblck ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/44374 Reviewed By: zhangguanheng66, mrshenli Differential Revision: D24455427 Pulled By: srinivas212 fbshipit-source-id: 42fdebdd14f8340098e2c34ef645bd40603552b1	2021-01-20 14:57:12 -08:00
Pritam Damania	4e248eb3f6	Change watchdog timeout logging from INFO to ERROR. (#50455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50455 Certain systems only print logging messages for ERROR/WARN and the error message that the watchdog is timing out a particular operation is pretty important. As a result, changing its level to ERROR instead of INFO. ghstack-source-id: 119761029 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25894795 fbshipit-source-id: 259b16c13f6cdf9cb1956602d15784b92aa53f17	2021-01-12 20:15:39 -08:00
Rohan Varma	78e71ce627	warn user once for possible unnecessary find_unused_params (#50133 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50133 `find_unused_parameters=True` is only needed when the model has unused parameters that are not known at model definition time or differ due to control flow. Unfortunately, many DDP users pass this flag in as `True` even when they do not need it, sometimes as a precaution to mitigate possible errors that may be raised (such as the error we raise with not using all outputs).While this is a larger issue to be fixed in DDP, it would also be useful to warn once if we did not detect unused parameters. The downside of this is that in the case of flow control models where the first iteration doesn't have unused params but the rest do, this would be a false warning. However, I think the warning's value exceeds this downside. ghstack-source-id: 119707101 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D25411118 fbshipit-source-id: 9f4a18ad8f45e364eae79b575cb1a9eaea45a86c	2021-01-12 02:55:06 -08:00
Rohan Varma	294b7867eb	Address clang-tidy warnings in ProcessGroupNCCL (#50131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50131 Noticed that in the internal diff for https://github.com/pytorch/pytorch/pull/49069 there was a clang-tidy warning to use emplace instead of push_back. This can save us a copy as it eliminates the unnecessary in-place construction ghstack-source-id: 119560979 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D25800134 fbshipit-source-id: 243e57318f5d6e43de524d4e5409893febe6164c	2021-01-07 21:29:28 -08:00
Jagadish Krishnamoorthy	c115957df0	[distributed] Provide parameter to pass GPU ID in barrier function (#49069 ) Summary: For a multi GPU node, rank and corresponding GPU mapping can be different. Provide optional parameter to specify the GPU device number for the allreduce operation in barrier function. Add test cases to validate barrier device_ids. Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Fixes https://github.com/pytorch/pytorch/issues/48110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49069 Reviewed By: mrshenli Differential Revision: D25658528 Pulled By: rohan-varma fbshipit-source-id: 418198b6224c8c1fd95993b80c072a8ff8f02eec	2021-01-05 11:27:54 -08:00
Omkar Salpekar	31fcbbdf35	[FileStore] Implemented numKeys and Added Tests (#49556 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49556 Implemented the missing Store functionality (specifically numKeys) in the FileStore. Test Plan: Added both C++ and Python tests to verify functionality. Reviewed By: jiayisuse Differential Revision: D25619001 fbshipit-source-id: 9146d0da9e0903622be3035880f619bbb2cc3891	2020-12-17 14:54:24 -08:00
Luca Wehrstedt	9234f5026d	Make WorkNCCL use CUDAEvent::query() rather than re-implement it (#49343 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49343 at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure. It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on. ghstack-source-id: 118532806 Test Plan: Unit tests Reviewed By: SciPioneer Differential Revision: D25537844 fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f	2020-12-15 03:15:48 -08:00
Luca Wehrstedt	f204f77e6d	Drop FutureNCCL in favor of vanilla CUDAFuture (#49014 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014 We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...). My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks. ghstack-source-id: 118391217 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25355272 fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674	2020-12-11 09:25:05 -08:00
Luca Wehrstedt	5ab90b2fda	Make CUDAFuture remember and restore current device in callback (#48789 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48789 CUDAFuture aims to "capture" the current state of CUDA-related stuff when the future is marked complete (e.g., by looking at current streams and recording events on them) and then "replicate" a similar state when users synchronize with the result of the future (by synchronizing the current streams with these events). However, one "contextual" aspect of CUDA that we weren't capturing/replicating was the current device. This diff tries to fix that. I must mention that we can only do this for callbacks, while we cannot do it for the wait() method. I don't know if such a discrepancy between the two actually makes the overall behavior _worse_. I'd love to hear people's opinions on this. ghstack-source-id: 118081338 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25210335 fbshipit-source-id: 1d1a3f80b1cc42e5114bc88554ed50617f1aaa90	2020-12-11 03:35:53 -08:00
Rohan Varma	696e30af6e	Fix ProcessGroupNCCL profiling when profiler is not run with use_cuda (#48946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48946 Move recordFunctionEndCallback to after the blocking portion of launching the NCCL kernel, and remove addCallback since it runs the lambda inline anyways, and triggers unnecessary CUDA stream logic. If we want CUDA operations such as NCCL kernels accurately profiled, we should use the profiler with use_cuda=True. However, we are currently debugging a deadlock for the use_cuda=True case, fix is being tracked in #48987. To ensure that the tests are no longer flaky, submitted this PR to ci-all: #48947 and ran the test a bunch of times ssh'd into the CI machine. ghstack-source-id: 118330130 Test Plan: Ci Reviewed By: mrzzd Differential Revision: D25368322 fbshipit-source-id: 7d17036248a3dcd855e58addc383bba64d6bc391	2020-12-10 21:09:41 -08:00
Yixin Bao	840e71f4e6	Check CUDA kernel launches (/fbcode/caffe2/) (#49145 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49145 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105 (1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/. (2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary. Test Plan: Test build: ``` buck build mode/dev-nosan //caffe2/modules/detectron: buck test mode/dev-nosan //caffe2/modules/detectron: buck build mode/dev-nosan //caffe2/torch/fb/: buck test mode/dev-nosan //caffe2/torch/fb/: ``` To check for launches without checks: ``` python3 caffe2/torch/testing/check_kernel_launches.py ``` Make sure none of the updated files are in the returned list. Reviewed By: r-barnes Differential Revision: D25452852 fbshipit-source-id: d6657edab612c9e0fa99b29c68460be8b1a20064	2020-12-10 10:43:03 -08:00
Luca Wehrstedt	b5a7e25059	Cache the DataPtrs in CUDAFuture (#48788 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48788 CUDAFuture needs to inspect the value it contains in order to first determine what devices its tensors reside on (so that it can record events on those devices), and then to record these tensors with the caching allocator when they are used in other streams. Extracting data ptrs can become somewhat expensive (especially if we resort to using the pickler to do that), hence it's probably a good idea to cache the result the first time we compute it. ghstack-source-id: 118180023 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25303486 fbshipit-source-id: 5c541640f6d19249dfb5489ba5e8fad2502836fb	2020-12-10 03:54:29 -08:00
Luca Wehrstedt	030fa6cfba	Split out reusable CUDAFuture from FutureNCCL (#48506 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48506 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL is now a general-purpose type-agnostic multi-device class, so in this commit I extract it from ProcessGroupNCCL to make it available for wider use (notably by the RPC module). We'll call this new class CUDAFuture. We'll keep FutureNCCL as a subclass of CUDAFuture to deal with some NCCL peculiarity, namely the fact that the future becomes complete immediately upon creation. We can clean this up for good once we're done merging Future and Work. I'm not exactly sure of where to put CUDAFuture. It needs to be available to both c10d and RPC (which lives under torch/csrc). If I figured CMake out correctly (and that's a big if) I think c10d can only depend on ATen (I'll maybe add a comment with how I tracked that down). Hence we cannot put CUDAFuture in torch/csrc. On the other hand, RPC currently depends on c10d, because RPC agents use ProcessGroups internally, so it would be "ok" to put CUDAFuture in c10d. However, we want to get rid of ProcessGroups in RPC, and at that point RPC should in principle not depend on c10d. In that case, the only shared dep between the two that I see is ATen itself. While I'm a bit wary of putting it right in ATen, I think it might actually make sense. CUDAFuture is intended to be a general-purpose component that can be reused in all settings and is not particularly tied to c10d or RPC. Moreover, ATen already contains ivalue::Future, and it contains a lot of CUDA helpers, so CUDAFuture definitely belongs to the "closure" of what's already there. ghstack-source-id: 118180030 Test Plan: Unit tests? Reviewed By: wanchaol Differential Revision: D25180532 fbshipit-source-id: 697f655240dbdd3be22a568d5102ab27691f86d4	2020-12-10 03:54:26 -08:00
Luca Wehrstedt	4c425e8da0	Merge common parts of FutureNCCL into at::ivalue::Future (#48505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48505 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In the previous commit, I split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In this commit, I'm removing these latter methods, and invoke the hooks directly from ivalue::Future. ghstack-source-id: 118180032 Test Plan: Unit tests Reviewed By: wanchaol Differential Revision: D25180535 fbshipit-source-id: 19181fe133152044eb677062a9e31e5e4ad3c03c	2020-12-10 03:54:22 -08:00
Luca Wehrstedt	9078088edb	Split FutureNCCL's CUDA-specific parts from generic future logic (#48504 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48504 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL isn't just adding CUDA support to ivalue::Future, it's also reimplementing a lot of the latter's logic (by overriding plenty of its methods). That's brittle, as whenever a new method is added to ivalue::Future there's a risk of forgetting to add it to FutureNCCL, and in such a case calling this method on FutureNCCL would defer to the base class and give inconsistent results (e.g., future not being completed when it actually is). This _is already happening_, for example with the waitAndThrow or hasError, which are not implemented by FutureNCCL. In addition, this creates duplication between the two classes, which could lead to inconsistencies of behavior, bugs, missing features, ... The best solution would be to keep the core future logic in ivalue::Future, and have _only_ the CUDA additions in FutureNCCL. That's what we're going to do, in two steps. In this commit, I'll split the CUDA features into separate hooks, which are called by FutureNCCL's other methods. In the next commit, I'll remove these latter methods, and invoke the hooks directly from ivalue::Future. ghstack-source-id: 118180025 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25180534 fbshipit-source-id: 7b3cd374aee78f6c07104daec793c4d248404c61	2020-12-10 03:54:19 -08:00
Luca Wehrstedt	a6778989d1	Support wider range of types in FutureNCCL (#48502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48502 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL restricted the values to be tensors, or (singleton) lists of tensors, or Python object that could be converted to either of those types. We need a CUDA future that can handle more generic types though. The main challenge is extracting all DataPtrs from an arbitrary object. I think I found some ways of doing so, but I'd like some JIT experts to look into this and tell me if there are better ways. I'll add inline comments for where their input would be appreciated. ghstack-source-id: 118180026 Test Plan: Unit tests (I should probably add new ones) Reviewed By: wanchaol Differential Revision: D25177562 fbshipit-source-id: 1ef18e67bf44543c70abb4ca152f1610dea4e533	2020-12-10 03:54:15 -08:00
Luca Wehrstedt	9fe3ac3650	Don't store device indices separately on FutureNCCL (#48501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48501 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL stores a set of devices (on which the tensors in the data reside) and a CUDA event for each of those devices. In fact, each event instance also already contains the device it belongs to, which means we can avoid storing that information separately (with the risk that it'll be mismatched and/or inaccurate). ghstack-source-id: 118180024 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177554 fbshipit-source-id: 64667c176efc2a7dafe99457a1fbba5d142cb06c	2020-12-10 03:54:12 -08:00
Luca Wehrstedt	e294c2d841	Add multi-GPU support to FutureNCCL (#48500 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48500 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- After the previous changes, this is now much simpler than it sounds. For the most part it just consists in repeating some operations multiple times, once for device (e.g., recording and blocking on events). Funnily, we already had a vector of events, even though we only ever stored one element in it (this probably comes from the fact that this is shared with WorkNCCL, which can hold more than one event). Here, we now also store a vector of device indices. Perhaps the only non-trivial part of this is that now, for "follow-up" Futures (for callbacks), we can't know in advance which device the result will be on so we must determine it dynamically when we receive the result, by inspecting it. That's also easier than it sound because we already have a dataptr extractor. ghstack-source-id: 118180022 Test Plan: Unit tests (I should probably add new ones) Reviewed By: mrshenli Differential Revision: D25177556 fbshipit-source-id: 41ef39ec0dc458e341aa1564f2b9f2b573d7fa9f	2020-12-10 03:54:09 -08:00
Luca Wehrstedt	91ad3ed831	Fix FutureNCCL not recording dataptrs with caching alloc in wait() (#48563 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48563 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- The CUDA caching allocator requires us to register all streams in which a DataPtr is used. We already do so when we invoke a callback, for which we obtain streams from the ATen pool. However, we didn't do so when the user waits for the Future and then uses the results in their current streams. This was probably fine in most cases, because the outputs of the NCCL ops (which is the tensors we're dealing with here) were user-provided, and thus already registered in some user streams, but in principle the user could use different streams when waiting than the ones they used to create the tensors. (If they use the same streams, registering becomes a no-op). But, more importantly, this change will help us turn FutureNCCL into a more general-purpose class as for example in RPC the tensors of the result are allocated by PyTorch itself and thus we need to record their usage on the user's streams with the caching allocator. ghstack-source-id: 118180033 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25210338 fbshipit-source-id: e0a4ba157653b74dd84cf5665c992ccce2dea188	2020-12-10 03:54:06 -08:00
Luca Wehrstedt	003c30ba82	Fix FutureNCCL's completed() disagreeing with wait() (#48503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48503 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- My impression is that one property of the upstream Future class is that once .wait() returns, or once a callback is invoked, then .completed() should return True. This was not the case for FutureNCCL because .wait() would return immediately, and callbacks would be invoked inline, but .completed() could return False if the CUDA async operations hadn't completed yet. That was odd and confusing. Since there are other ways for users to check the status of CUDA operations (if they really need, and typically I don't think it's so common), perhaps it's best to avoid checking the status of CUDA events in .completed(). ghstack-source-id: 118180028 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25180531 fbshipit-source-id: e1207f6b91f010f278923cc5fec1190d0fcdab30	2020-12-10 03:54:02 -08:00
Luca Wehrstedt	b91b0872a1	Record CUDA events for "follow-up" FutureNCCL inside markCompleted (#48499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48499 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- We can merge and "hide" a whole bunch of CUDA-related logic if we store and record the CUDA events that correspond to the completion of a FutureNCCL when we call markCompleted (rather than splitting it between the constructor, the `then` method, and a wrapper around the callback). A more concrete reason for this change is that soon I'll add support for multi-device, and in that case we can't necessarily know in advance which devices a value will be on until we get that value (and we don't want to record an event on all devices as then we might "over-synchronize"). To me, this also makes more conceptual sense: the moment when we store a value on the future, which is the "signal" that the future is now ready, should also be time at which we record the events needed to synchronize with that value. Though this may just be personal preference. ghstack-source-id: 118180034 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177557 fbshipit-source-id: 53d4bcdfb89fa0d11bb7b1b94db5d652edeb3b7b	2020-12-10 03:53:59 -08:00
Luca Wehrstedt	6157f8aeb5	Use fresh stream from pool for each FutureNCCL callback (#48498 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48498 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- FutureNCCL has a dedicated CUDA stream that it sets as current when running callbacks. This stream is initialized by the ProcessGroupNCCL by extracting it from the global ATen pool. In order to decouple FutureNCCL from that specific ProcessGroup and make it more generic, in this commit we make FutureNCCL extract a fresh stream from the ATen pool each time it needs one. This introduces a functional change, because it removes the implicit synchronization and ordering between the callbacks of a same Future. In fact, such an ordering is hard to guarantee in the general case as, for example, a user could attach a new callback just after the future becomes completed, and thus that callback would be run inline, immediately, out-of-order wrt the other callbacks. (There are ways to "fix" this but they are complicated). NCCL got around this because its futures are already marked complete when they're returned, but in fact it could also run into issues if multiple threads were adding callbacks simultaneously. Note that it remains still possible to enforce ordering between callbacks, but one must now do so explicitly. Namely, instead of this: ``` fut.then(cb1) fut.then(cb2) ``` one must now do: ``` fut.then(cb1).then(cb2) ``` ghstack-source-id: 118180029 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177559 fbshipit-source-id: 4d4e73ea7bda0ea65066548109b9ea6d5b465599	2020-12-10 03:53:56 -08:00
Luca Wehrstedt	8fb52e7fa2	Make FutureNCCL record events in current stream (#48497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48497 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- When we record the events to mark a "follow-up" future complete (for a callback), we used to record them onto the dedicated stream, but that streams is the current stream at that time, so instead we could just record them onto the current stream. This introduces no functional differences. The reason I'm adding such an additional layer of indirection is so that the dedicated stream is only referenced inside the `addCallback` method, which will later allow us to more easily change how that stream works. ghstack-source-id: 118180035 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177553 fbshipit-source-id: c6373eddd34bd399df09fd4861915bf98fd50681	2020-12-10 03:53:53 -08:00
Luca Wehrstedt	e4267eb424	Have FutureNCCL record streams w/ allocator in addCallback (#48496 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48496 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- There are two ways to add a callback to a Future: `then` and `addCallback` (with the former deferring to the latter). FutureNCCL only "patched" `then`, which caused `addCallback` to be unsupported. By patching `addCallback`, on the other hand, we cover both. The high-level goal of this change though is to remove all CUDA-specific stuff from `then`, and move it to either `markCompleted` or to a wrapper around the callback. This will take a few more steps to achieve. ghstack-source-id: 118180031 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177558 fbshipit-source-id: ee0ad24eb2e56494c353db700319858ef9dcf32b	2020-12-10 03:53:50 -08:00
Luca Wehrstedt	868a1a48c6	Add some safeguards to FutureNCCL (#48562 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48562 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- In this commit I'm adding a few asserts to the constructors of FutureNCCL to make sure that what's passed in is what we expect (fun fact: until two commits ago that wasn't the case, as we were passed some empty events). I'm also making the second constructor private, as it's only supposed to be used by the then() method. ghstack-source-id: 118180036 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25210333 fbshipit-source-id: d2eacf0f7de5cc763e3cdd1ae5fd521fd2eec317	2020-12-10 03:53:47 -08:00
Luca Wehrstedt	b7f5aa9890	Remove NCCL dependency from PythonFutureWrapper (#48495 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48495 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- PythonFutureWrapper needs to provide a GIL-aware way to extract tensors from an IValue of type PyObject. Since this was only used by FutureNCCL it was guarded by #ifdef USE_C10D_NCCL. However, we will need to use it with CUDA-aware futures other than the NCCL one. This might have been achieved simply by replacing USE_C10D_NCCL with USE_CUDA, but I wanted to clean this up better. We're dealing with two independent dimensions: C++-vs-Python and CPU-vs-CUDA. To make the code more modular, the two dimensions should be dealt with by orthogonal solutions: the user setting a custom callback to handle Python, and the subclass being CUDA-aware. Mixing these two axes makes it more complicated. Another reason for changing how this works is that later on, when we'll introduce multi-device support, we'll need to extract dataptrs for other reasons too (rather than just recording streams with the caching allocator), namely to inspect the value to determine which devices it resides on. ghstack-source-id: 118180038 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25177560 fbshipit-source-id: 3a424610c1ea191e8371ffee0a26d62639895884	2020-12-10 03:53:44 -08:00
Luca Wehrstedt	7f7f0fa335	Avoid using FutureNCCL before it's ready (#48561 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48561 This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed). --- WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams. `383abf1f0c/torch/lib/c10d/ProcessGroup.cpp (L66)` In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it. Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes. ghstack-source-id: 118180037 Test Plan: Unit tests Reviewed By: mrshenli Differential Revision: D25210337 fbshipit-source-id: 54033c814ac77641cbbe79b4d01686dfc2b45495	2020-12-10 03:48:43 -08:00
Supriya Rao	bfa95f90a0	Revert D25325039: Check CUDA kernel launches (/fbcode/caffe2/) Test Plan: revert-hammer Differential Revision: D25325039 (`f5e9ffbc27`) Original commit changeset: 2043d6e63c7d fbshipit-source-id: 5377dd2aa7c6f58c8641c956b7642c7c559bbc40	2020-12-09 14:07:16 -08:00
Yixin Bao	f5e9ffbc27	Check CUDA kernel launches (/fbcode/caffe2/) (#49105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49105 (1) Add a safety check `C10_CUDA_KERNEL_LAUNCH_CHECK()` after each kernel launch. This diff only changes the files inside the directory /fbsource/fbcode/caffe2/modules/, /fbsource/fbcode/caffe2/fb/, /fbsource/fbcode/caffe2/test/. (2) Get rid of old check `AT_CUDA_CHECK(cudaGetLastError())` when necessary. Test Plan: Test build: ``` buck build //caffe2/modules/detectron: buck build //caffe2/torch/fb/: ``` To check for launches without checks: ``` python3 caffe2/torch/testing/check_kernel_launches.py ``` Make sure none of the updated files are in the returned list. Reviewed By: r-barnes Differential Revision: D25325039 fbshipit-source-id: 2043d6e63c7d029c35576d3101c18247ffe92f01	2020-12-09 12:34:55 -08:00
Yi Wang	7439bc4dd6	[Gradient Compression] Add an index field to GradBucket for PowerSGD (#48757 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48757 Add an index field to GradBucekt, so error_dict is keyed by this index instead of the hashcode of input tensor. The replacement will be done in a separate diff, as the definition of this new method somehow couldn't be recognized in the OSS version. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 117939208 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl Reviewed By: rohan-varma Differential Revision: D25288496 fbshipit-source-id: 6f71977809690a0367e408bd59601ee62c9c03ea	2020-12-05 01:39:58 -08:00
Yanan Cao	a3298c2f64	Implement JIT serialization of ProcessGroup (#48544 ) Summary: This diff enables JIT serialization of `ProcessGroup`, including both base `ProcessGroup` class and derived classes like `ProcessGroupNCCL`. If a `ProcessGroup` is created via high-level APIs like `dist_c10d.frontend().new_process_group_helper()`, they are automatically serializable. If a `ProcessGroup` is created via its derived class TorchBind APIs like `dist_c10d.ProcessGroupNCCL()`, then it has to be given a name and registered with `dist_c10d.frontend().register_process_group_name` to be uniquely identifiable and serializable. * Fixed a minor bug in new dist_c10d frontend which fails to check whether a process group is used or not * Fixed an issue where `test_jit_c10d.py` wasn't really run due to a configuration bug. Now tests are run as a slow test (need ci-all/* branch) Pull Request resolved: https://github.com/pytorch/pytorch/pull/48544 Reviewed By: wanchaol Differential Revision: D25298309 Pulled By: gmagogsfm fbshipit-source-id: ed27ce37373c88277dc0c78704c48d4c19d46d46	2020-12-04 18:44:38 -08:00
Nikita Shulga	5654fc8edd	Revert D25293474: [pytorch][PR] Server connects to its listen socket addr Test Plan: revert-hammer Differential Revision: D25293474 (`7c9ba62130`) Original commit changeset: 15f75dab48a4 fbshipit-source-id: 71ca136f2aa3204ad49f76c604f51c477cba270a	2020-12-04 17:08:03 -08:00
Zrss	7c9ba62130	Server connects to its listen socket addr (#46801 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/46800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/46801 Reviewed By: heitorschueroff Differential Revision: D25293474 fbshipit-source-id: 15f75dab48a4360645436360c216885cf3bd5667	2020-12-04 13:21:57 -08:00
Joe Zhu	92f376147c	Enable TCPStore on Windows (#47749 ) Summary: Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance. Related RFC is https://github.com/pytorch/pytorch/issues/47659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749 Reviewed By: bdhirsh Differential Revision: D25220401 Pulled By: mrshenli fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529	2020-12-03 08:32:01 -08:00
Ilia Cherniavskii	f7a8bf2855	Use libkineto in profiler (#46470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470 Adding ability to use Kineto (CUPTI) to profile CUDA kernels Test Plan: USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install python test/test_profiler.py python test/test_autograd.py -k test_profile python test/test_autograd.py -k test_record ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2 sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1 aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2 aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4 aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2 aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3 aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3 aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3 cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3 cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3 aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1 aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3 cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2 aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ``` benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a Reviewed By: Chillee Differential Revision: D25142223 Pulled By: ilia-cher fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80	2020-11-25 04:32:16 -08:00
Chester Liu	8177f63c91	Reorganize and refine the Windows.h import in C++ files (#48009 ) Summary: This PR aims to reduce the import overhead and symbol noises from the `windows.h` headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48009 Reviewed By: gchanan Differential Revision: D25045840 Pulled By: ezyang fbshipit-source-id: 01fda70f433ba2dd0cd2d7cd676ab6ffe9d98b90	2020-11-20 14:21:09 -08:00
Yanan Cao	28580d3c0f	Add TorchBind-based Python and TorchScript binding for ProcessGroup (#47907 ) Summary: Add TorchBind-binding for ProcessGroup class. Currently there are a few limitation of TorchBind that prevents us from fully matching existing PyBind-binding of ProcessGroup: - TorchBind doesn't support method overloading. Current PyBind binding uses overloading extensively to provide flexible API, but TorchBind (and TorchScript ClassType behind it) doesn't yet support it. Therefore, we can provide at most one version of API under each name. - TorchBind doesn't support C++ enums yet. This prevents us from making real uses of XXXOptions, which is widely used in many APIs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/47907 Reviewed By: wanchaol Differential Revision: D24945814 Pulled By: gmagogsfm fbshipit-source-id: e103d448849ea838c10414068c3e4795db91ab1c	2020-11-19 20:25:56 -08:00
Yanan Cao	db767b7862	Add c10d new frontend to build (#48146 ) Summary: Stack from [ghstack](https://github.com/ezyang/ghstack): * https://github.com/pytorch/pytorch/issues/48148 Add TorchBind-based Python and TorchScript binding for ProcessGroup * https://github.com/pytorch/pytorch/issues/48147 Add process group creation logic in c10d new frontend * https://github.com/pytorch/pytorch/issues/48146 Add c10d new frontend to build Pull Request resolved: https://github.com/pytorch/pytorch/pull/48146 Reviewed By: wanchaol Differential Revision: D25073969 Pulled By: gmagogsfm fbshipit-source-id: d111649144a4de9f380e5f7a2ad936860de4bd7b	2020-11-19 04:47:02 -08:00
Scott Wolchok	383abf1f0c	[PyTorch] Make RecordFunction::active private (#47549 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47549 In preparation for moving state onto the heap. ghstack-source-id: 117027862 Test Plan: CI Reviewed By: ilia-cher Differential Revision: D24812214 fbshipit-source-id: 1455c2782b66f6a59c4d45ba58e1c4c92402a323	2020-11-18 17:58:54 -08:00
Omkar Salpekar	f8c559db8e	[resubmit] Providing more information while crashing process in async error handling (#47246 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47246 We crash the process in NCCL Async Error Handling if the collective has been running for greater than some set timeout. This PR introduces more information about the rank and duration the collective ran. ghstack-source-id: 116676182 Test Plan: Run desync tests and flow. Reviewed By: pritamdamania87 Differential Revision: D24695126 fbshipit-source-id: 61ae46477065a1a451dc46fb29c3ac0073ca531b	2020-11-13 20:11:06 -08:00
Omkar Salpekar	5d51b63984	Use Blocking Wait if both Blocking Wait and Async Error Handling Are Set (#47926 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47926 Given that we're soon enabling async error handling in PET, we should make the behavior explicit when users have set NCCL_BLOCKING_WAIT in their own code while also using PET. This PR essentially gives blocking wait precedence (for now). This way the blast radius of the PET change is smaller, while we continue working with blocking wait users and discussing whether moving to async error handling may be a good fit. ghstack-source-id: 116553583 Test Plan: Simple FBL run/CI Reviewed By: jiayisuse Differential Revision: D24928149 fbshipit-source-id: d42c038ad44607feb3d46dd65925237c564ff7a3	2020-11-13 14:43:00 -08:00
Wanchao Liang	553ccccc54	[c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24723418 Pulled By: wanchaol fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77	2020-11-12 07:36:23 -08:00
Wanchao Liang	a02baa0c7a	[reland][c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47807 reland https://github.com/pytorch/pytorch/pull/47075 Test Plan: wait for ci Reviewed By: gmagogsfm Differential Revision: D24905247 fbshipit-source-id: abd9731d86b3bd48d60bbc90d534823e0c037b93	2020-11-11 22:53:22 -08:00
Wanchao Liang	665ac2f7b0	[reland] [c10d] switch Store to be managed by intrusive_ptr (#47808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47808 reland https://github.com/pytorch/pytorch/pull/47074 Test Plan: wait for ci Reviewed By: gmagogsfm Differential Revision: D24905246 fbshipit-source-id: edeb7e6e486570ce889f12512e9dc02061d6cc03	2020-11-11 22:53:20 -08:00
Wanchao Liang	70ae5685f9	[reland][c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#47806 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47806 reland https://github.com/pytorch/pytorch/pull/44046 Test Plan: wait for ci Reviewed By: gmagogsfm Differential Revision: D24905245 fbshipit-source-id: ad75ace5432fcfd22d513878f5a73c4bb017324e	2020-11-11 22:51:03 -08:00
Wanchao Liang	dac0192148	Revert D23632280: [c10d] switch ProcessGroup::Work to be managed by intrusive_ptr Test Plan: revert-hammer Differential Revision: D23632280 (`0650a6166f`) Original commit changeset: 0a4642a8ffab fbshipit-source-id: 2aa8ddb874fab11f773f4c08d740afcd865482e9	2020-11-11 10:54:08 -08:00
Wanchao Liang	1f946e942d	Revert D24667128: [c10d] switch Store to be managed by intrusive_ptr Test Plan: revert-hammer Differential Revision: D24667128 (`0cfe3451d4`) Original commit changeset: 9b6024c31c85 fbshipit-source-id: d8ddf9eb2fccef5023e05698e0c4662708fe4945	2020-11-11 10:49:58 -08:00
Wanchao Liang	2204374fd4	Revert D24667127: [c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr Test Plan: revert-hammer Differential Revision: D24667127 (`ae5c2febb9`) Original commit changeset: 54986193ba1b fbshipit-source-id: 12e1ebea1981c0b1b6dff4c8a2e2045878d44537	2020-11-11 10:42:33 -08:00
Wanchao Liang	ae5c2febb9	[c10d] switch ProcessGroupNCCL:Options to be managed by intrusive_ptr (#47075 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47075 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D24667127 Pulled By: wanchaol fbshipit-source-id: 54986193ba1b22480622a2e9d6d41d9472d201f3	2020-11-10 23:36:47 -08:00
Wanchao Liang	0cfe3451d4	[c10d] switch Store to be managed by intrusive_ptr (#47074 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47074 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D24667128 Pulled By: wanchaol fbshipit-source-id: 9b6024c31c851b7c3243540f460ae57323da523b	2020-11-10 23:36:44 -08:00
Wanchao Liang	0650a6166f	[c10d] switch ProcessGroup::Work to be managed by intrusive_ptr (#44046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44046 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D23632280 Pulled By: wanchaol fbshipit-source-id: 0a4642a8ffabdd26c52c1baabfa30c0f446c3c85	2020-11-10 23:30:22 -08:00
Yanan Cao	9d0c6e9469	Implement Complex tensor support in all reduce and all gather (#47523 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47523 Reviewed By: bdhirsh Differential Revision: D24806743 Pulled By: gmagogsfm fbshipit-source-id: 627a5a0654c603bc82b90e4cb3d924b4ca416fbe	2020-11-06 22:26:48 -08:00
Mehdi Mirzazadeh	160db3db4f	Adding profiling capability to c++ ddp collective functions (#46471 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46471 ghstack-source-id: 116018837 Test Plan: Added unit tests: buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork Reviewed By: rohan-varma Differential Revision: D23948397 fbshipit-source-id: 6d93a370aff26bf96c39e5d78a2492c5142a9156	2020-11-06 10:29:58 -08:00
Yi Wang	6b3802a711	[Gradient Compression] Export sizes, along with length and offset of each variable to GradBucket for PowerSGD (#47203 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47203 1. Create a new field in BucketReplica to store sizes info for each variable. 2. Export sizes list, along with lengths and offsets to GradBuceket. These fields are needed for PowerSGD. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 115875194 Test Plan: Checked the field values from log. Reviewed By: rohan-varma Differential Revision: D24644137 fbshipit-source-id: bcec0daf0d02cbf25389bfd9be90df1e6fd8fc56	2020-11-04 12:34:53 -08:00
Yanan Cao	5c4bd9a38f	Move python-independent c10d implementations to torch/lib (#47309 ) Summary: * This is a pre-step to build c10d into libtorch * Includes a minor cleanup in c10d/CMakeLists.txt Pull Request resolved: https://github.com/pytorch/pytorch/pull/47309 Reviewed By: wanchaol Differential Revision: D24711768 Pulled By: gmagogsfm fbshipit-source-id: 6f9e0a6a73c30f5ac7dafde9082efcc4b725dde1	2020-11-03 23:39:54 -08:00
Yi Wang	f91fcefc81	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#47270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47270 This is almost same as #46959, except that in caffe2/torch/nn/parallel/distributed.py, BuiltinCommHookType should be imported conditionally, only when dist.is_available(). Otherwise, this Python enum type defined in caffe2/torch/scrc/distributed/c10d/init.cpp cannot be imported. See https://github.com/pytorch/pytorch/issues/47153 I tried to follow another enum type enum type ReduceOp defined in the same file, but did not work, because the C++ enum class is defined torch/lib/c10d library, but BuiltinCommHookType is defined in torch/csrc/distributed library. These two libraries are compiled in two different ways. To avoid adding typing to distributed package, which can be a new project, I simply removed the arg type of BuiltinCommHookType in this file. To review the diff on top of #46959, compare V1 vs Latest: https://www.internalfb.com/diff/D24700959?src_version_fbid=270445741055617 Main Changes in V1 (#46959): 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115783237 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl //arvr/projects/eye_tracking/Masquerade:python_test USE_DISTRIBUTED=0 USE_GLOO=0 BUILD_TEST=0 USE_CUDA=1 USE_MKLDNN=0 DEBUG=0 python setup.py install Reviewed By: mrshenli Differential Revision: D24700959 fbshipit-source-id: 69f303a48ae275aa856e6e9b50e12ad8602e1c7a	2020-11-03 18:33:50 -08:00
Omkar Salpekar	8b13ab9370	Event Logging for NCCL Async Error Handling Process Crash (#47244 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47244 This is an event-logging based update that should allow us to collect high-quality data about how many times the NCCL Async Error Handling mechanism is triggered. This logs an event called `ProcessGroupNCCL.WorkNCCL.handleNCCLGuard`, which is recorded as an entry in the `scuba_caffe2_pytorch_usage_stats` Scuba table. This Scuba entry will also contain metadata like workflow status, entitlement, hostnames, and workflow names, which will give us insight into what workloads/domains and machines are benefiting from async error handling. It also contains the Flow Run ID, which can be used as a join key with the `fblearner_workflow_run_status` scuba table for additional information like final error message, etc. We can easily quantify the number of times the async handling code was triggered by querying the `scuba_caffe2_pytorch_usage_stats` table. As a demonstration, I ran the following workflow with this diff patched: f229675892 Since the workflow above causes a desync, the `handleNCCLGuard` event is logged in scuba soon. See here for the filtered table: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio As you can see, there are 4 entries. The workflow above uses 3 GPUs, 2 of which run into the desync scenario and are crashed using async error handling. We make this fail twice before succeeding the 3rd time, hence 4 entries. ghstack-source-id: 115708632 Test Plan: Did a quick demo as described above. Scuba entries with the logs can be found here: https://www.fburl.com/scuba/scuba_caffe2_pytorch_usage_stats/tmp1uvio Reviewed By: jiayisuse Differential Revision: D24688739 fbshipit-source-id: 7532dfeebc53e291fbe10d28a6e50df6324455b1	2020-11-03 13:42:42 -08:00
Yi Wang	b1b77148ac	Back out "[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks" (#47234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47234 Revert the diff because of https://github.com/pytorch/pytorch/issues/47153 Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115720415 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24691866 fbshipit-source-id: 58fe0c45943a2ae2a09fe5d5eac4a4d947586539	2020-11-02 20:51:18 -08:00
Alban Desmaison	c10aa44e33	Back out "Providing more information while crashing process in async error handling" (#47185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47185 Original commit changeset: 02d48f13352a Test Plan: CI Reviewed By: mruberry Differential Revision: D24682055 fbshipit-source-id: 060efa29eb2f322971848ead447021f6972cb3f3	2020-11-02 08:34:30 -08:00
Yi Wang	ee0033af9b	[Gradient Compression] Surface C++ comm hooks to Python API as built-in comm hooks (#46959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46959 1. Implemented the Pybind part. 2. In the reducer, once the builtin_comm_hook_type is set, a c++ comm hook instance will be created in Reducer::autograd_hook. 3. Added unit tests for the builit-in comm hooks. Original PR issue: C++ DDP Communication Hook https://github.com/pytorch/pytorch/issues/46348 ghstack-source-id: 115629230 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl Reviewed By: pritamdamania87 Differential Revision: D24471910 fbshipit-source-id: f96b752298549ea2067e2568189f1b394abcd99a	2020-10-30 23:19:42 -07:00
Omkar Salpekar	7eb427e931	Providing more information while crashing process in async error handling (#46274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46274 We crash the process in NCCL Async Error Handling if the collective has been running for greater than some set timeout. This PR logs more information about the rank and duration the collective ran before throwing an exception. ghstack-source-id: 115614622 Test Plan: Run desync tests and flow. Here are the Flow runs showing the right messages: f225031389 f225032004 Reviewed By: jiayisuse Differential Revision: D24200144 fbshipit-source-id: 02d48f13352aed40a4476768c123d5cebbedc8e0	2020-10-30 16:22:51 -07:00
Jeff Daily	ce5bca5502	ProcessGroupNCCL::alltoall_base needs to call recordStream (#46603 ) Summary: For similar reasons as documented in the `[Sync Streams]` note. For a current example, `ProcessGroupNCCL::allgather` must also call `recordStream` and does so already. The output tensor is created on the default stream (by the application). NCCL/RCCL internally uses another stream (i.e., ncclStream). If we do not record the output tensor on the ncclStream, there is a chance that the output tensor might be deallocated while NCCL/RCCL is using it. The application is not aware of the ncclStream since it's internal to ProcessGroupNCCL. So, the application cannot record the output tensor on the ncclStream. Patch originally developed by sarunyap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46603 Reviewed By: srinivas212 Differential Revision: D24458530 fbshipit-source-id: b02e74d1c3a176ea1b9bbdd7dc671b221fcadaef	2020-10-22 15:53:19 -07:00
Yi Wang	98aad933b6	[pytorch][PR] Record FutureNCCL callback stream on CUDA caching allocator (#45318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45318 When calling `then()` from WorkNCCL, record the input data pointers in futureNCCLCallbackStream_ before the execution of the input callback. Note that the recording cannot be directly added to the lambda used by addCallback in ProcessGroupNCCL.hpp. This is because the type of future value in that context is pyobject rather than TensorList, but a type casting will require pybind and introduce Python dependency, which should not be allowed in c10d library. I have considered creating a util function in a separate file to support this type casting, and then placing it under torch/csrc directory where python dependency is allowed. However, torch/csrc has a dependency on c10d, so this will create a circular dependency. Finally, a `record_stream_cb_` member is added to FutureNCCL, and the default value is nullptr. A default `record_stream_cb_` implementation is added to `PythonFutureWrapper,` where Python dependency is allowed. In addition, a few lines are reformatted by lint. caffe2/torch/csrc/distributed/c10d/init.cpp is only reformatted. #Closes: https://github.com/pytorch/pytorch/issues/44203 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- ProcessGroupNCCLTest buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_accumulate_gradients_no_sync_allreduce_with_then_hook buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_with_then_hook_nccl Reviewed By: pritamdamania87 Differential Revision: D23910257 fbshipit-source-id: 66920746c41f3a27a3689f22e2a2d9709d0faa15	2020-10-22 01:49:47 -07:00
Omkar Salpekar	2e2fe8cf3b	[NCCL] Modularize ncclCommWatchdog (#46051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46051 Creates a subroutine for aborting timed out collectives. This should help modularize the ncclCommWatchdog a bit, since it is growing too large. ghstack-source-id: 114398496 Test Plan: Successful Flow Run: f225037915 f217609101 Reviewed By: jiayisuse Differential Revision: D23607535 fbshipit-source-id: 0b1c9483bcd3a41847fc8c0bf6b22cdba01fb1e6	2020-10-16 11:06:40 -07:00
Alexander Golynski	e7e919fc34	Add warning on ProcessGroup and ProcessGroup::Work APIs (#46220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46220 Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D24294437 Pulled By: gmagogsfm fbshipit-source-id: 198f8e5760beeb1d18740f971647d2537afb3dd6	2020-10-14 16:27:37 -07:00
Omkar Salpekar	d655341adb	[Distributed] General Function for Parsing Environment Variable Flags in PG (#46045 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46045 PG NCCL functionality differs based on certain binary environment variables such as NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING. Previously we had separate helper function to parse these env vars and set class variables accordingly. This PR introduces a general purpose function for this purpose. ghstack-source-id: 114209823 Test Plan: Ran the following flow with NCCL_BLOCKING_WAIT set, and ensured the ProcessGroup constructor set blcokingWait_ to true: f223454701 Reviewed By: jiayisuse Differential Revision: D24173982 fbshipit-source-id: b84db2dda29fcf5d163ce8860e8499d5070f8818	2020-10-14 12:21:11 -07:00
Omkar Salpekar	2ffb768607	[Distributed] deleteKey support for HashStore (#46049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46049 Adding support for the deleteKey API in the c10d HashStore. ghstack-source-id: 113874207 Test Plan: Added C++ tests to check whether deleteKey function works, and whether it returns an exception for attempting to delete non-existing keys. Reviewed By: jiayisuse Differential Revision: D24067657 fbshipit-source-id: 4c58dab407c6ffe209585ca91aa430850261b29e	2020-10-14 12:04:42 -07:00
Omkar Salpekar	74f13a8b8f	[Distributed] Adding getNumKeys support to the HashStore (#46048 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46048 This PR adds support for the getNumKeys API for the HashStore ghstack-source-id: 113874241 Test Plan: Added C++ tests for the HashStore::getNumKeys Reviewed By: jiayisuse Differential Revision: D24067658 fbshipit-source-id: 2db70a90f0ab8ddf0ff03cedda59b45ec987af07	2020-10-14 12:01:22 -07:00
Rohan Varma	965046c445	[NCCL] Provide additional information about NCCL error codes. (#45950 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45950 A pain point for debugging failed training jobs due to NCCL errors has been understanding the source of the error, since NCCL does not itself report too many details (usually just "unhandled {system, cuda, internal} error"). In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail. Tested by forcing `result = ncclSystemError` in the macro. The new error message looks like: ```RuntimeError: NCCL error in: caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL version 2.7.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed. ``` The last line is what we have added to the message. In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL provides more details about errors sa well. ghstack-source-id: 114219288 Test Plan: CI Reviewed By: mingzhe09088 Differential Revision: D24155894 fbshipit-source-id: 10810ddf94d6f8cd4989ddb3436ddc702533e1e1	2020-10-13 21:18:20 -07:00
Omkar Salpekar	952dc7ed87	[NCCL] Fix Hang in Async Error Handling due to Work logging (#46265 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46265 tl;dr - we must remove tensor-related logging from the WorkNCCL::operator<< function, otherwise printing the work objects tracked in the workMetaList_ will cause segfaults. The Work objects we track in the workMetaList for the NCCL Async Error Handling mechanism don't have any `outputs_`. As described in the workEnqueue function, destructing the output tensors calls into autograd_meta, which happens in the user thread, but our system destructs work objects in the workCleanupThread, so this could lead to a deadlock scenario. We avoid this problem by not tracking the tensors in the work objects in the workMetaList (it's called work meta list because these work objects only track the metadata and not the actual tensors), so when the WorkNCCL::operator<< function tried to log tensor shapes for work objects from the watchdog thread, the async error handling mechanism hanged (in the desync test) or segfaulted (in the desync flow). This PR removes the tensor-related logging from the operator<< function. ghstack-source-id: 114192929 Test Plan: Verified that this fixes the desync test and desync flow. Reviewed By: jiayisuse Differential Revision: D24268204 fbshipit-source-id: 20ccb8800aa3d71a48bfa3cbb65e07ead42cd0dc	2020-10-13 16:23:56 -07:00
Omkar Salpekar	172036a565	[NCCL] Add Error log when ProcessGroupNCCL takes down process upon (#44988 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44988 The new NCCL async error handling feature throws an exception from the workCleanup Thread if one of the NCCL operations encounters an error or times out. This PR adds an error log to make it more clear to the user why the training process crashed. ghstack-source-id: 114002493 Test Plan: Verified that we see this error message when running with the desync test. Reviewed By: pritamdamania87 Differential Revision: D23794801 fbshipit-source-id: 16a44ce51f01531062167fb762a8553221363698	2020-10-09 16:58:50 -07:00
Omkar Salpekar	e33d455ef7	[Distributed] Set smaller Store timeouts to make c10d tests run faster (#46067 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46067 In our store tests, we expect there to be an exception when we call get on a recently-deleted key. Unforunately, the store waits for the timeout period for the key to be set before throwing, which causes the tests to idel wait for 5+ minutes. This PR decreases the timeouts before this set call so these tests run faster. ghstack-source-id: 113917315 Test Plan: Ran both the Python and C++ tests. Reviewed By: pritamdamania87 Differential Revision: D24208617 fbshipit-source-id: c536e59ee305e0c01c44198a3b1a2247b8672af2	2020-10-09 15:45:42 -07:00
Pritam Damania	c83314e982	[ci-all tests] Improve logging in ProcessGroupNCCL for debugging purposes. (#46010 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46010 When training jobs running with NCCL fail sometimes it is hard to debug the reason of the failure and our logging doesn't provide enough information at times to narrow down the issue. To improve the debugging experience, I've enhanced our logging to add a lot more information about what the ProcessGroup is doing under the hood. #Closes: https://github.com/pytorch/pytorch/issues/45310 Sample output: ``` > I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started! > I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options: > NCCL_ASYNC_ERROR_HANDLING: 0 > NCCL_BLOCKING_WAIT: 1 > TIMEOUT(ms): 1000 > USE_HIGH_PRIORITY_STREAM: 0 > I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally > I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000) > I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally > I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators > I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally ``` ghstack-source-id: 113961408 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24183463 fbshipit-source-id: cb09c1fb3739972294e7edde4aae331477621c67	2020-10-09 09:46:58 -07:00
Mingzhe Li	8cd3857bc7	[NCCL] Add torch::cuda::nccl::send/recv (#45926 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45926 torch/csrc/cuda/nccl.cpp is compiled as part of torch_cuda library and thus by calling this function from ProcessGroupNCCCL.cpp it avoids linking 2nd instance of libnccl.a into torch_python Fixes similiar issue as https://github.com/pytorch/pytorch/issues/42517 ghstack-source-id: 113910530 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24147802 fbshipit-source-id: d8901fdb31bdc22ddca2364f8050844639a1beb3	2020-10-08 19:20:40 -07:00
Mingzhe Li	b7f7378b2d	[NCCL] support send/recv to/from self when communicator is created on demand (#45873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873 This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes. ghstack-source-id: 113910526 Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D24124413 fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d	2020-10-08 19:19:15 -07:00
Nikita Shulga	c19b9cd18d	Add torch::cuda::ncll::all2all (#45900 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45900 Use `torch:cuda::nccl:all2all` from `ProcesGroupNCCL.cpp` Fixes https://github.com/pytorch/pytorch/issues/42517 Here is a NCCL dependency graph: ``` libnccl.a --> libtorch_cuda.so ---> libtorch_python.so \| ^ \| \| --------> libc10d.a ----------------- ``` When static library is linked into a dynamic library or an executable, linker is removes all unused/duplicate symbols from that library, unless `-whole-archive` option is used. Before https://github.com/pytorch/pytorch/pull/42514 all nccl call made from `ProcessGroupNCCL.cpp` were also made from `torch/csrc/cuda/nccl.cpp`, which is compiled as part of `libtorch_cuda.so` But adding `ncclSend`\|`ncclRecv` to ProcesGroupNCCL.cpp forced linker to embed those into `libtorch_python.so`, which also resulted in linking other dependent symbols into the library. This PR adds `nccl[Send\|Recv]` call to `torch_cuda.so` by implementing `all2all` in `torch_cuda` and thus avoids double linking the static library. More involved, but prone solution, would be to use wrappers exported in `torch::cuda::nccl` namespace, instead of making direct NCCL API calls. Test Plan: Imported from OSS Reviewed By: mingzhe09088 Differential Revision: D24138011 Pulled By: malfet fbshipit-source-id: 33305197fc7d8707b7fd3a66b543f7733b9241a1	2020-10-07 23:56:31 -07:00
Natalia Gimelshein	de0d0bd5ee	Revert D24093032: Improve logging in ProcessGroupNCCL for debugging purposes. Test Plan: revert-hammer Differential Revision: D24093032 (`c8d76ff7dc`) Original commit changeset: 240b03562f8c fbshipit-source-id: dab7d54a5ba517bb308a1825b0d63ed146e5269d	2020-10-07 16:41:35 -07:00
Pritam Damania	c8d76ff7dc	Improve logging in ProcessGroupNCCL for debugging purposes. (#45780 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45780 When training jobs running with NCCL fail sometimes it is hard to debug the reason of the failure and our logging doesn't provide enough information at times to narrow down the issue. To improve the debugging experience, I've enhanced our logging to add a lot more information about what the ProcessGroup is doing under the hood. #Closes: https://github.com/pytorch/pytorch/issues/45310 Sample output: ``` > I1002 15:18:48.539551 1822062 ProcessGroupNCCL.cpp:528] [Rank 2] NCCL watchdog thread started! > I1002 15:18:48.539533 1821946 ProcessGroupNCCL.cpp:492] [Rank 2] ProcessGroupNCCL initialized with following options: > NCCL_ASYNC_ERROR_HANDLING: 0 > NCCL_BLOCKING_WAIT: 1 > TIMEOUT(ms): 1000 > USE_HIGH_PRIORITY_STREAM: 0 > I1002 15:18:51.080338 1822035 ProcessGroupNCCL.cpp:530] [Rank 1] NCCL watchdog thread terminated normally > I1002 15:18:52.161218 1821930 ProcessGroupNCCL.cpp:385] [Rank 0] Wrote aborted communicator id to store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > I1002 15:18:52.161238 1821930 ProcessGroupNCCL.cpp:388] [Rank 0] Caught collective operation timeout for work: WorkNCCL(OpType=ALLREDUCE, TensorShape=[10], Timeout(ms)=1000) > I1002 15:18:52.162120 1821957 ProcessGroupNCCL.cpp:530] [Rank 0] NCCL watchdog thread terminated normally > I1002 15:18:58.539937 1822062 ProcessGroupNCCL.cpp:649] [Rank 2] Found key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, from rank: 0, aborting appropriate communicators > I1002 15:19:34.740937 1822062 ProcessGroupNCCL.cpp:662] [Rank 2] Aborted communicators for key in store: NCCLABORTEDCOMM:a0e17500002836080c8384c50000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > I1002 15:19:34.741678 1822062 ProcessGroupNCCL.cpp:530] [Rank 2] NCCL watchdog thread terminated normally ``` ghstack-source-id: 113731163 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24093032 fbshipit-source-id: 240b03562f8ccccc3d872538f5e331df598ceca7	2020-10-07 12:18:41 -07:00
Mingzhe Li	10d86d1196	[NCCL] create NCCL communicator for send/recv on demand (#44922 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44922 For NCCL send/recv operations, we will create NCCL communicator on demand following the same design as how it's currently done for collective operations. ghstack-source-id: 113592757 Test Plan: to add Reviewed By: pritamdamania87 Differential Revision: D23773726 fbshipit-source-id: 0d47c29d670ddc07f7181e8485af0e02e2c9cfaf	2020-10-05 18:33:03 -07:00
Mingzhe Li	59083d6176	[NCCL] Support NCCL Send/Recv (#44921 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921 This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context. ghstack-source-id: 113592785 Test Plan: unittest Reviewed By: jiayisuse Differential Revision: D23709848 fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258	2020-10-05 18:27:57 -07:00
Pritam Damania	b5a2f04089	Disallow creation of ProcessGroupNCCL without GPUs. (#45642 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45642 Prior to https://github.com/pytorch/pytorch/pull/45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc	2020-10-05 12:05:48 -07:00
Hongyi Jia	06a566373a	[PyTorch/NCCL] Fix async error handling (#45456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456 Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100% SyncBatchNorm failure trace: P143879560 Test Plan: Desync test: BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync SyncBatchNorm test: BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient Reviewed By: osalpekar Differential Revision: D23972071 fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f	2020-09-29 15:44:34 -07:00
Omkar Salpekar	6b65b3cbd8	[Distributed] DeleteKey API for c10d TCP Store (#45401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45401 Added a DeleteKey API for the TCP Store ghstack-source-id: 112997162 Test Plan: Modified the existing get/set test to use delete. verified that the correct keys were deleted and that the numKeys API returned the right values Reviewed By: mrshenli Differential Revision: D23955730 fbshipit-source-id: 5c9f82be34ff4521c59f56f8d9c1abf775c67f9f	2020-09-28 15:30:39 -07:00
Natalia Gimelshein	78caa028b6	Revert D23009117: [Distributed] DeleteKey API for c10d TCP Store Test Plan: revert-hammer Differential Revision: D23009117 (`addf94f2d6`) Original commit changeset: 1a0d95b43d79 fbshipit-source-id: ad3fe5501267e1a0a7bf23410766f1e92b34b24d	2020-09-27 12:04:42 -07:00
Omkar Salpekar	addf94f2d6	[Distributed] DeleteKey API for c10d TCP Store (#43963 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43963 Added a DeleteKey API for the TCP Store ghstack-source-id: 112939762 Test Plan: Modified the existing get/set test to use delete. verified that the correct keys were deleted and that the numKeys API returned the right values Reviewed By: jiayisuse Differential Revision: D23009117 fbshipit-source-id: 1a0d95b43d79e665a69b2befbaa059b2b50a1f66	2020-09-26 00:54:21 -07:00
Omkar Salpekar	304e1d1e19	[Distributed] getNumKeys API to c10d TCPStore (#43962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43962 TCPStore needs a getNumKeys API for our logging needs. ghstack-source-id: 112939761 Test Plan: Adding tests to C++ Store Tests Reviewed By: pritamdamania87 Differential Revision: D22985085 fbshipit-source-id: 8a0d286fbd6fd314dcc997bae3aad0e62b51af83	2020-09-26 00:49:00 -07:00
gunandrose4u	f07ac6a004	Fix Windows build failure after DDP PR merged (#45335 ) Summary: Fixes #{issue number} This is resubmit for PR https://github.com/pytorch/pytorch/issues/42897 . Together with fix for Windows build issue introduced by PR https://github.com/pytorch/pytorch/issues/44344 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/45335 Reviewed By: zou3519 Differential Revision: D23931471 Pulled By: mrshenli fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494	2020-09-25 12:37:50 -07:00
Mike Ruberry	103fa3894a	Revert D23841786: [pytorch][PR] Enable distributed package on windows, Gloo backend supported only Test Plan: revert-hammer Differential Revision: D23841786 (`0122299f9b`) Original commit changeset: 334ba1ed73ef fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f	2020-09-24 22:44:33 -07:00
gunandrose4u	0122299f9b	Enable distributed package on windows, Gloo backend supported only (#42897 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42095 For test case part will be committed to this PR later mrshenli, please help to review Pull Request resolved: https://github.com/pytorch/pytorch/pull/42897 Reviewed By: osalpekar Differential Revision: D23841786 Pulled By: mrshenli fbshipit-source-id: 334ba1ed73eff2f668857390fc32d1bc7f08e5f3	2020-09-24 21:13:55 -07:00
Mingzhe Li	574f9af160	[NCCL] Add option to run NCCL on high priority cuda stream (#43796 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43796 This diff adds an option for the process group NCCL backend to pick high priority cuda streams. Test Plan: waitforsandcastle Reviewed By: jiayisuse Differential Revision: D23404286 fbshipit-source-id: b79ae097b7cd945a26e8ba1dd13ad3147ac790eb	2020-09-16 16:00:41 -07:00
Omkar Salpekar	f7278473d3	[NCCL] Fix NCCL_BLOCKING_WAIT functionality with Async Error Handling (#44411 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44411 This basically aborts errored NCCL communicators if either blocking wait or async error handling is enabled. Otherwise we may abort nccl communicators where neither are enabled, and this may result in subsequent GPU operations using corrupted data. ghstack-source-id: 111839264 Test Plan: Succesful Flow run: f217591683 Reviewed By: jiayisuse Differential Revision: D23605382 fbshipit-source-id: 6c16f9626362be3b0ce2feaf0979b2dff97ce61b	2020-09-10 20:57:55 -07:00
Mehdi Mirzazadeh	2e744b1820	Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43970 It is resubmition of #43386 Original commit changeset: 27fbeb161706 ghstack-source-id: 111775070 Test Plan: Added checks to existing unit test and ran it on gpu devserver. Verified the test that was failing in original diff also passes: https://app.circleci.com/pipelines/github/pytorch/pytorch/210229/workflows/86bde47b-f2da-48e3-a618-566ae2713102/jobs/7253683 Reviewed By: pritamdamania87 Differential Revision: D23455047 fbshipit-source-id: b8dc4a30b95570d68a482c19131674fff2a3bc7c	2020-09-10 17:13:37 -07:00
Yi Wang	38c10b4f30	[NCCL] Fix the initialization of futureNCCLCallbackStreams (#44347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44347 Cloned from Pull Request resolved: https://github.com/pytorch/pytorch/pull/44097, because the original author Sinan has completed the internship and now is unable to submit this diff. As johnsonpaul mentioned in D23277575 (`7d517cf96f`). It looks like all processes were allocating memory on GPU-ID=0. I was able to reproduce it by running `test_ddp_comm_hook_allreduce_with_then_hook_nccl` unit test of `test_c10d.py` and running `nvidia-smi` while test was running. The issue was reproduced as: ``` +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 `3132563` C python 777MiB \| \| 0 3132564 C python 775MiB \| \| 4 3132564 C python 473MiB \| +-----------------------------------------------------------------------------+ ``` I realized that as we initialize ProcessGroupNCCL both processes were initially allocating memory on GPU 0. We later also realized that I forgot `isHighPriority` input of `getStreamFromPool` and `futureNCCLCallbackStreams_.push_back(std::make_shared<at::cuda::CUDAStream>(at::cuda::getStreamFromPool(device_index)));` was just creating a vector of GPU 0 streams. As i changed `at::cuda::getStreamFromPool(device_index)` to `at::cuda::getStreamFromPool(false, device_index)`. `nvidia-smi` looked like: ``` +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 673925 C python 771MiB \| \| 0 673926 C python 771MiB \| \| 1 673925 C python 771MiB \| \| 1 673926 C python 771MiB \| \| 2 673925 C python 771MiB \| \| 2 673926 C python 771MiB \| \| 3 673925 C python 771MiB \| \| 3 673926 C python 771MiB \| \| 4 673925 C python 771MiB \| \| 4 673926 C python 771MiB \| \| 5 673925 C python 771MiB \| \| 5 673926 C python 771MiB \| \| 6 673925 C python 771MiB \| \| 6 673926 C python 771MiB \| \| 7 673925 C python 707MiB \| \| 7 673926 C python 623MiB \| +-----------------------------------------------------------------------------+ ``` This confirms that we were just getting GPU 0 streams for the callback. I think this does not explain the `fp16_compress` stability issue, because we were able to reproduce that even without any then callback and just calling copy from fp32 to fp16 before allreduce. However, this can explain other issues where `allreduce` was not on par with `no_hook`. I'll run some additional simulations with this diff. I tried to to replace `getStreamFromPool` by `getDefaultCUDAStream(deviceIndex)` and it wasn't causing additional memory usage. In this diff, I temporarily solved the issue by just initializing null pointers for each device in the constructor and setting the callback stream for corresponding devices inside `ProcessGroupNCCL::getNCCLComm`. After the fix it looks like the memory issue was resolved: ``` +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 2513142 C python 745MiB \| \| 4 2513144 C python 747MiB \| +-----------------------------------------------------------------------------+ ``` I could use a dictionary instead of a vector for `futureNCCLCallbackStreams_`, but since number of devices is fixed, I think it isn't necessary. Please let me know what you think in the comments. ghstack-source-id: 111485483 Test Plan: `test_c10d.py` and some perf tests. Also check `nvidia-smi` while running tests to validate memory looks okay. This diff also fixes the regression in HPC tests as we register a hook: {F322730175} See https://fb.quip.com/IGuaAbD8 (`474fdd7e2d`)bnvy for details. Reviewed By: pritamdamania87 Differential Revision: D23495436 fbshipit-source-id: ad08e1d94343252224595d7c8a279fe75e244822	2020-09-10 11:25:38 -07:00
Omkar Salpekar	e028ad0762	Fix HashStoreTests and move to Gtest (#43384 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43384 Much like the FileStoreTests, the HashStoreTests were also run in a single blob and threw exceptions upon failure. This modularizes the test by separating each function into separate gtest test cases. ghstack-source-id: 111690834 Test Plan: Confirmed that the tests pass on devvm. Reviewed By: jiayisuse Differential Revision: D23257579 fbshipit-source-id: 7e821f0e9ee74c8b815f06facddfdb7dc2724294	2020-09-09 17:56:33 -07:00
Omkar Salpekar	69a3ff005d	Modularize FileStoreTest and move to Gtest (#43383 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43383 FileStore Test currently has a large blob of tests that throw exceptions upon failure. This PR modularizes each test so they can run independently, and migrates the framework to gtest. ghstack-source-id: 111690831 Test Plan: Confirmed tests pass on devvm Reviewed By: jiayisuse Differential Revision: D22879473 fbshipit-source-id: 6fa5468e594a53c9a6b972757068dfc41645703e	2020-09-09 17:56:30 -07:00
Omkar Salpekar	a7fba7de22	Convert StoreTestUtils to Gtest (#43382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43382 StoreTestCommon defines standard helper functions that are used by all of our Store tests. These helpers currently throw exceptions upon failure, this PR changes them to use gtest assertions instead. ghstack-source-id: 111690833 Test Plan: Tested the 2 PR's above this on devvm Reviewed By: jiayisuse Differential Revision: D22828156 fbshipit-source-id: 9e116cf2904e05ac0342a441e483501e00aad3dd	2020-09-09 17:55:25 -07:00
Omkar Salpekar	48c47db8fe	[NCCL] Add Environment Variable to guard Async Error Handling feature (#44163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163 In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. ghstack-source-id: 111637788 Test Plan: CI/Sandcastle. We will turn on this env var by default in torchelastic and HPC trainer soon. Reviewed By: jiayisuse Differential Revision: D23517895 fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc	2020-09-09 12:26:25 -07:00
Omkar Salpekar	211ece7267	[NCCL] ProcessGroupNCCL Destructor Blocks on WorkNCCL Completion (#41054 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41054 This Commit: ProcessGroupNCCL destructor now blocks until all WorkNCCL objects have either been aborted or completed and removed from the work vector. This Stack: The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614314 Test Plan: 1. DDP Sanity Check: First we have a sanity check based on the PyTorch DDP benchmark. This verifies that the baseline DDP training with NCCL for standard CU workloads works well (esp. with standard models like Resnet50 and BERT). Here is a sample Flow: f213293473 1. HPC Performance Benchmarks: This stack has undergone thorough testing and profiling on the Training Cluster with varying number of nodes. This introduces 1-1.5% QPS regression only (~200-400 QPS regression for 8-64 GPUs). 1. HPC Accuracy Benchmarks: We've confirmed NE parity with the existing NCCL/DDP stack without this change. 1. Kernel-Specific Benchmarks: We have profiled other approaches for this system (such as cudaStreamAddCallback) and performed microbenchmarks to confirm the current solution is optimal. 1. Sandcastle/CI: Apart from the recently fixed ProcessGroupNCCL tests, we will also introduce a new test for desynchronization scenarios. Reviewed By: jiayisuse Differential Revision: D22054298 fbshipit-source-id: 2b95a4430a4c9e9348611fd9cbcb476096183c06	2020-09-09 12:26:22 -07:00
Omkar Salpekar	afbf2f140b	[NCCL] WorkNCCL Helper Functions (#41053 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053 This Commit: Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive. This Stack: The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614315 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21943520 fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb	2020-09-09 12:26:18 -07:00
Omkar Salpekar	f8f7b7840d	[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052 This Commit: Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) This Stack: The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614313 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21943151 fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453	2020-09-09 12:26:15 -07:00
Omkar Salpekar	4e5c55ef69	[NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051 This Commit: In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures. This Stack: The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614319 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21938498 fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07	2020-09-09 12:26:12 -07:00
Omkar Salpekar	1df24fd457	[NCCL] Timeout Loop Thread for Async Error Handling (#41050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41050 This Commit: We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion. This Stack: The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21916637 fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe	2020-09-09 12:25:06 -07:00
Omkar Salpekar	7c464eed16	Skipping CUDA tests in ProcessGroupGloo and logs (#42488 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42488 Currently, ProcessGroupGloo tests do not emit logs if the test was skipped due CUDA not being available/not enough CUDA devices. This PR clarifies the reason for skipping through these logs. ghstack-source-id: 111638111 Test Plan: tested on devvm and devgpu Reviewed By: jiayisuse Differential Revision: D22879396 fbshipit-source-id: d483ca46b5e22ed986521262c11a1c6dbfbe7efd	2020-09-09 10:52:52 -07:00
Nikita Shulga	7035cd0f84	Revert D23216393: Support work.result() to get result tensors for allreduce for Gloo, NCCL backends Test Plan: revert-hammer Differential Revision: D23216393 (`0b2694cd11`) Original commit changeset: fed5e37fbabb fbshipit-source-id: 27fbeb1617066fa3f271a681cb089622027d6689	2020-09-01 10:32:38 -07:00
Mehdi Mirzazadeh	0b2694cd11	Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43386 Resolves #43178 ghstack-source-id: 111109716 Test Plan: Added checks to existing unit test and ran it on gpu devserver. Reviewed By: rohan-varma Differential Revision: D23216393 fbshipit-source-id: fed5e37fbabbd2ac4a9055b20057fffe3c416c0b	2020-09-01 08:05:55 -07:00
Pritam Damania	f1624b82b5	Preserve python backtrace in autograd engine errors. (#43684 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684 This PR attempts to address #42560 by capturing the appropriate exception_ptr in the autograd engine and passing it over to the Future. As part of this change, there is a significant change the Future API where we now only accept an exception_ptr as part of setError. For the example in #42560, the exception trace would now look like: ``` > Traceback (most recent call last): > File "test_autograd.py", line 6914, in test_preserve_backtrace > Foo.apply(t).sum().backward() > File "torch/tensor.py", line 214, in backward > torch.autograd.backward(self, gradient, retain_graph, create_graph) > File "torch/autograd/__init__.py", line 127, in backward > allow_unreachable=True) # allow_unreachable flag > File "torch/autograd/function.py", line 87, in apply > return self._forward_cls.backward(self, *args) > File "test_autograd.py", line 6910, in backward > raise ValueError("something") > ValueError: something ``` ghstack-source-id: 111109637 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D23365408 fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5	2020-09-01 01:28:47 -07:00
Ashkan Aliabadi	4e39c310eb	Move torch/csrc/utils/hash.h to c10/util/hash.h. (#42503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42503 Test Plan: Imported from OSS Reviewed By: IvanKobzarev Differential Revision: D23252331 Pulled By: AshkanAliabadi fbshipit-source-id: 3c4c0e27b9a7eec8560e374c2a3ba5f1c65dae48	2020-08-29 17:47:00 -07:00
Sinan Nasir	7d517cf96f	[NCCL] Dedicated stream to run all FutureNCCL callbacks. (#43447 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43447 Two main better-engineering motivations to run all FutureNCCL callbacks on a dedicated stream: 1. Each time a then callback was called, we would get a stream from the pool and run the callback on that stream. If we observe the stream traces using that approach, we would see a lot of streams and debugging would become more complicated. If we have a dedicated stream to run all then callback operations, the trace results will be much cleaner and easier to follow. 2. getStreamFromPool may eventually return the default stream or a stream that is used for other operations. This can cause slowdowns. Unless then callback takes longer than preceding allreduce, this approach will be as performant as the previous approach. ghstack-source-id: 110909401 Test Plan: Perf trace runs to validate the desired behavior: See the dedicated stream 152 is running the then callback operations: {F299759342} I run pytorch.benchmark.main.workflow using resnet50 and 32 GPUs registering allreduce with then hook. See f213777896 [traces](https://www.internalfb.com/intern/perfdoctor/results?run_id=26197585) After updates, same observation: see f214890101 Reviewed By: malfet Differential Revision: D23277575 fbshipit-source-id: 67a89900ed7b70f3daa92505f75049c547d6b4d9	2020-08-28 17:26:23 -07:00
Rohan Varma	5ca6cbbd93	Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43543 Closes https://github.com/pytorch/pytorch/issues/14691. This is not needed in the multiple outputs case, because gloo allreduce will broadcast the result tensor to all the outputs. See https://github.com/facebookincubator/gloo/issues/152 and commit `9cabb5aaa4` for more details. Came across this when debugging https://github.com/pytorch/pytorch/pull/42577. This effectively reverts https://github.com/pytorch/pytorch/pull/14688 while still keeping the tests. Tested by ensuring `test_allreduce_basics` in `test_c10d.py` still works as expected. ghstack-source-id: 110636498 Test Plan: CI Reviewed By: mrshenli Differential Revision: D23173945 fbshipit-source-id: d1ae08f84b4ac9919c53080949b8fffcb2fe63a8	2020-08-25 14:01:26 -07:00
Sinan Nasir	6e1127ea3f	[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues. Reviewed By: pritamdamania87 Differential Revision: D23055807 fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5	2020-08-19 19:42:22 -07:00
Hongyi Jia	d467ac8ff0	[GLOO] handle empty split size (#43256 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43256 * Handle empty split size by moving to call computeLengthsAndOffsets() * Enable GLOO alltoall python tests ghstack-source-id: 109292763 Test Plan: buck build mode/dev-nosan caffe2/torch/lib/c10d:ProcessGroupGlooTest ./trainer_cmd.sh -p 16 -n 8 -d gloo (modify ./trainer_cmd.sh a bit) Reviewed By: mingzhe09088 Differential Revision: D22961600 fbshipit-source-id: b9e90dadf7b45323b8af2e6cab2e156043b7743b	2020-08-19 11:14:06 -07:00
Hongyi Jia	c9e825640a	[c10d] Template computeLengthsAndOffsets() (#42706 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42706 Different backends accept different type of length to, like MPI_Alltoallv, nccSend/Recv(), gloo::alltoallv(). So to make computeLengthsAndOffsets() template Test Plan: Sandcastle CI HPC: ./trainer_cmd.sh -p 16 -n 8 -d nccl Reviewed By: osalpekar Differential Revision: D22961459 fbshipit-source-id: 45ec271f8271b96f2dba76cd9dce3e678bcfb625	2020-08-10 19:21:46 -07:00
Sinan Nasir	0a804be47d	[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback (#42335 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42335 Main goal: For DDP communication hook, provide an API called "get_future" to retrieve a future associated with the completion of c10d.ProcessGroupNCCL.work. Enable NCCL support for this API in this diff. We add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. We no longer consider a design that involves cudaStreamAddCallback which potentially was causing performance regression in [#41596](https://github.com/pytorch/pytorch/pull/41596). ghstack-source-id: 109461507 Test Plan: ```(pytorch) [sinannasir@devgpu017.ash6 ~/local/pytorch] python test/distributed/test_c10d.py Couldn't download test skip set, leaving all tests enabled... ..............................s.....................................................s................................ ---------------------------------------------------------------------- Ran 117 tests in 298.042s OK (skipped=2) ``` ### Facebook Internal: 2\. HPC PT trainer run to validate no regression. Check the QPS number: Master: QPS after 1000 iters: around ~34100 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_master" --trainers 16 --trainer-version 1c53912 ``` ``` [0] I0806 142048.682 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950479 0.953704], lifetime NE: [0.963963 0.950479 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34199 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_mastwarm.trainer.trainer%2F0&ta_tab=logs) getFuture/new design: QPS after 1000 iters: around ~34030 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"testvideo_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160149.197 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963959 0.950477 0.953704], lifetime NE: [0.963959 0.950477 0.953704], loss: [0.243456 0.235225 0.248375], QPS: 34018 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtestvideo_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/new design Run 2: QPS after 1000 iters: around ~34200 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER"test2video_getFutureCyclicFix" --trainers 16 --trainer-version 8553aee ``` ``` [0] I0806 160444.650 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963963 0.950482 0.953706], lifetime NE: [0.963963 0.950482 0.953706], loss: [0.243456 0.235225 0.248375], QPS: 34201 ``` [detailed logs](https://www.internalfb.com/intern/tupperware/details/task/?handle=priv3_global%2Fmast_hpc%2Fhpc.sinannasirtest2video_getFutureCyclicFix.trainer.trainer%2F0&ta_tab=logs) getFuture/old design (Regression): QPS after 1000 iters: around ~31150 ``` hpc_dist_trainer --fb-data=none --mtml-fusion-level=1 --target-model=ifr_video --max-ind-range=1000000 --embedding-partition=row-wise mast --domain $USER”testvideo_OLDgetFutureD22583690 (`d904ea5972`)" --trainers 16 --trainer-version 1cb5cbb ``` ``` priv3_global/mast_hpc/hpc.sinannasirtestvideo_OLDgetFutureD22583690 (`d904ea5972`).trainer.trainer/0 [0] I0805 101320.407 metrics_publishers.py:50] Finished iter 999, Local window NE: [0.963964 0.950482 0.953703], lifetime NE: [0.963964 0.950482 0.953703], loss: [0.243456 0.235225 0.248375], QPS: 31159 ``` 3\. `flow-cli` tests; roberta_base; world_size=4: Master: f210039922 ``` total: 32 GPUs -- 32 GPUs: p25: 0.908 35/s p50: 1.002 31/s p75: 1.035 30/s p90: 1.051 30/s p95: 1.063 30/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 452/s p50: 0.071 449/s p75: 0.072 446/s p90: 0.072 445/s p95: 0.072 444/s backward: 32 GPUs -- 32 GPUs: p25: 0.821 38/s p50: 0.915 34/s p75: 0.948 33/s p90: 0.964 33/s p95: 0.976 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2035/s p75: 0.016 2027/s p90: 0.016 2019/s p95: 0.016 2017/s ``` getFuture new design: f210285797 ``` total: 32 GPUs -- 32 GPUs: p25: 0.952 33/s p50: 1.031 31/s p75: 1.046 30/s p90: 1.055 30/s p95: 1.070 29/s forward: 32 GPUs -- 32 GPUs: p25: 0.071 449/s p50: 0.072 446/s p75: 0.072 445/s p90: 0.072 444/s p95: 0.072 443/s backward: 32 GPUs -- 32 GPUs: p25: 0.865 37/s p50: 0.943 33/s p75: 0.958 33/s p90: 0.968 33/s p95: 0.982 32/s optimizer: 32 GPUs -- 32 GPUs: p25: 0.016 2037/s p50: 0.016 2033/s p75: 0.016 2022/s p90: 0.016 2018/s p95: 0.016 2017/s ``` Reviewed By: ezyang Differential Revision: D22833298 fbshipit-source-id: 1bb268d3b00335b42ee235c112f93ebe2f25b208	2020-08-07 18:48:35 -07:00
Darius Tan	6ebc0504ca	BAND, BOR and BXOR for NCCL (all_)reduce should throw runtime errors (#42669 ) Summary: cc rohan-varma Fixes https://github.com/pytorch/pytorch/issues/41362 #39708 # Description NCCL doesn't support `BAND, BOR, BXOR`. Since the [current mapping](`0642d17efc/torch/lib/c10d/ProcessGroupNCCL.cpp (L39)`) doesn't contain any of the mentioned bitwise operator, a default value of `ncclSum` is used instead. This PR should provide the expected behaviour where a runtime exception is thrown. # Notes - The way I'm throwing exceptions is derived from [ProcessGroupGloo.cpp](`0642d17efc/torch/lib/c10d/ProcessGroupGloo.cpp (L101)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/42669 Reviewed By: ezyang Differential Revision: D22996295 Pulled By: rohan-varma fbshipit-source-id: 83a9fedf11050d2890f9f05ebcedf53be0fc3516	2020-08-07 13:09:07 -07:00
Omkar Salpekar	e97e87368e	Clean up CUDA Sleep and Tensor Initialization in ProcessGroupNCCLTest (#42211 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42211 Helper functions for launching CUDA Sleep and Tensor Value Initialization for the collective test functions. This is more of a code cleanup fix compared to the previous diffs. ghstack-source-id: 109097243 Test Plan: working on devGPU and devvm Reviewed By: jiayisuse Differential Revision: D22782671 fbshipit-source-id: 7d88f568a4e08feae778669affe69c8d638973db	2020-08-04 12:36:27 -07:00
Omkar Salpekar	3ca361791f	TearDown function for ProcessGroupNCCLTest Initializer (#42209 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42209 This PR adds a TearDown function to the testing superclass to ensure that the NCCL_BLOCKING_WAIT environment variable is reset after each test case. ghstack-source-id: 109097247 Test Plan: Working on devGPU and devvm. Reviewed By: jiayisuse Differential Revision: D22782672 fbshipit-source-id: 8f919a96d7112f9f167e90ce3df59886c88f3514	2020-08-04 12:36:24 -07:00
Omkar Salpekar	2b8e7e2f2d	Moving ProcessGroupNCCLTest to Gtest (#42208 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42208 ProcessGroupNCCLTest is currently written without any testing framework, and all tests are simply called from the main function and throw exceptions upon failure. As a result, it is hard to debug and pinpoint which tests have succeeded/failed. This PR moves ProcessGroupNCCLTest to gtest with appropriate setup and skipping functionality in the test superclass. ghstack-source-id: 109097246 Test Plan: Working Correctly on devGPU and devvm. Reviewed By: jiayisuse Differential Revision: D22782673 fbshipit-source-id: 85bd407f4534f3d339ddcdd65ef3d2022aeb7064	2020-08-04 12:34:09 -07:00
gunandrose4u	d2a2ac4eea	Fix read/write bulk data (#42504 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42504 Reviewed By: glaringlee Differential Revision: D22922750 Pulled By: mrshenli fbshipit-source-id: 9008fa22c00513bd75c3cf88a3081184cd72b0e3	2020-08-04 11:30:53 -07:00
Srinivas Sridharan	ecb88c5d11	Add NCCL Alltoall to PT NCCL process group (#42514 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42514 Add Alltoall and Alltoallv to PT NCCL process group using NCCL Send/Recv. Reviewed By: mrshenli Differential Revision: D22917967 fbshipit-source-id: 402f2870915bc237845864a4a27c97df4351d975	2020-08-04 08:39:28 -07:00
Mustafa Said Mehmetoglu	44b018ddeb	Convert ProcessGroupNCCLTest.cpp to gtest unittest (#42365 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42365 Converting the test Reviewed By: malfet Differential Revision: D22855087 fbshipit-source-id: dc917950dcf99ec7036e48aaa4264d2c455cb19e	2020-07-31 20:34:11 -07:00
Brandon Lin	4c6878c97d	[gloo] change ProcessGroupGlooAsyncTest to use gtest (#42313 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42313 Changes the tests in `ProcessGroupGlooAsyncTest.cpp` to use the Gtest testing framework. Reviewed By: malfet Differential Revision: D22821577 fbshipit-source-id: 326b24a334ae84a16434d0d5ef27d16ba4b90d5d	2020-07-31 08:54:50 -07:00
Omkar Salpekar	b6a9f42758	Add appropriate error messages for ProcessGroupNCCLTest (#42143 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42143 Replaces the original makeshift error messages in ProcessGroupNCCLTest with more appropriate ones. ghstack-source-id: 108711579 Test Plan: Ran the tests on DevGPU Reviewed By: mrshenli Differential Revision: D22778505 fbshipit-source-id: 27109874f0b474a74b09f588cf6e7528d2069702	2020-07-28 18:31:23 -07:00
Omkar Salpekar	e4c3f526c8	Fixed Skipping Logic in ProcessGroupNCCLErrors tests (#42192 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42192 This PR fixes the complicated skipping logic for ProcessGroupNCCLErrors Tests - it correctly logs the reason for skipping tests when GPUs are not available or the NCCL version is too old. This is part of a broader effort to improve the testing of the ProcessGroup and Collectives tests. ghstack-source-id: 108620568 Test Plan: Tested on devGPU and devvm. Tests are run correctly on GPU and skipped on CPU as expected. Reviewed By: mrshenli Differential Revision: D22782856 fbshipit-source-id: 6071dfdd9743f45e59295e5cee09e89c8eb299c9	2020-07-28 16:59:40 -07:00
Jongsoo Park	73ff252913	Back out "[NCCL] DDP communication hook: getFuture()" (#42152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42152 Original commit changeset: 8c059745261d Test Plan: . Reviewed By: ajtulloch, jianyuh Differential Revision: D22786183 fbshipit-source-id: 51155389d37dc82ccb4d2fa20d350f9d14abeaca	2020-07-28 10:05:35 -07:00
Nikita Shulga	fbdaa555a2	Enable ProcessGroupGlooTest in CI (take 2) (#42086 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/42073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/42086 Reviewed By: ngimel Differential Revision: D22765777 Pulled By: malfet fbshipit-source-id: ebbcd44f448a1e7f9a3d18fa9967461129dd1dcd	2020-07-27 10:21:59 -07:00
Shen Li	47e6d4b3c8	Revert D22741514: [pytorch][PR] Enable ProcessGroupGlooTest in CI Test Plan: revert-hammer Differential Revision: D22741514 (`45e6f2d600`) Original commit changeset: 738d2e27f523 fbshipit-source-id: 0381105ed0ab676b0abd1927f602a35b1b264a6a	2020-07-25 18:19:17 -07:00
Rohan Varma	366c014a77	[Resubmit #41318 ] NCCL backend support for torch bool (#41959 ) Summary: Resubmit of https://github.com/pytorch/pytorch/issues/41318 pushed to ci-all branch. Original description: Closes https://github.com/pytorch/pytorch/issues/24137. This PR adds support for the torch.bool tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since bool is not supported as a native ncclDataType_t, we add the following logic: Map at::kBool to ncclUint8 During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR. The reduction logic (for example for reduce/allreduce) is as follows: sum, max = bitwise or product, min = bitwise and Note that this PR doesn't add support for BAND/BOR/BXOR. That is because these reduction ops currently are not supported by NCCL backend, see https://github.com/pytorch/pytorch/issues/41362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41959 Reviewed By: mrshenli Differential Revision: D22719665 Pulled By: rohan-varma fbshipit-source-id: 8bc4194a8d1268589640242277124f277d2ec9f1	2020-07-24 23:44:29 -07:00
Omkar Salpekar	6287f9ed65	Remove AllGatherTestWithTimeout (#41945 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41945 This test previously did a thread sleep before launching the allgather operation, and then waited on the work object. Since the sleep was done before the work object was created, it did not affect the allgather call, and thus, did not test work-level timeouts as intended. I am removing this test for now. In the future we can add this test back, but would need to somehow inject a `cudaSleep` call before the allgather (so the collective operation itself is delayed). This may require overriding the `ProcessGroupNCCL::collective`, so it's a bit more heavy-weight. In the meantime, we can remove this test - work-level timeouts are still thoroughly tested with Gloo. ghstack-source-id: 108370178 Test Plan: Ran ProcessGroupNCCL tests on devGPU Reviewed By: jiayisuse Differential Revision: D22702291 fbshipit-source-id: a36ac3d83abfab6351c0476046a2f3b04a80c44d	2020-07-24 18:17:48 -07:00
Nikita Shulga	45e6f2d600	Enable ProcessGroupGlooTest in CI (#41985 ) Summary: Partially addresses https://github.com/pytorch/pytorch/issues/41143 Pull Request resolved: https://github.com/pytorch/pytorch/pull/41985 Reviewed By: rohan-varma Differential Revision: D22741514 Pulled By: malfet fbshipit-source-id: 738d2e27f52334e402b65b724b8ba3b0b41372ee	2020-07-24 17:44:00 -07:00
Sinan Nasir	d904ea5972	[NCCL] DDP communication hook: getFuture() (#41596 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108409748 Test Plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). Reviewed By: izdeby Differential Revision: D22583690 fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a	2020-07-24 11:22:44 -07:00
Shen Li	dbe6bfbd7e	Revert D22496604: NCCL Backend support for torch.bool Test Plan: revert-hammer Differential Revision: D22496604 (`3626473105`) Original commit changeset: a1a15381ec41 fbshipit-source-id: 693c2f9fd1df568508cbcf8c734c092cec3b0a72	2020-07-23 15:33:58 -07:00
Rohan Varma	3626473105	NCCL Backend support for torch.bool (#41318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41318 Closes https://github.com/pytorch/pytorch/issues/24137. This PR adds support for the `torch.bool` tensor type to ProcessGroupNCCL. For most types we use the existing mapping, but since `bool` is not supported as a native `ncclDataType_t`, we add the following logic: 1) Map `at::kBool` to `ncclUint8` 2) During reduction (allreduce for example), if the operation is SUM, we instead override to to a MAX, to avoid overflow issues. The rest of the operations work with no changes. In the boolean case, changing sum to max makes no correctness difference since they both function as a bitwise OR. The reduction logic (for example for reduce/allreduce) is as follows: sum, max = bitwise or product, min = bitwise and Tests are added to ensure that the reductions work as expected. ghstack-source-id: 108315417 Test Plan: Added unittests Reviewed By: mrshenli Differential Revision: D22496604 fbshipit-source-id: a1a15381ec41dc59923591885d40d966886ff556	2020-07-23 12:33:39 -07:00
Shen Li	b80ffd44b0	Revert D20781624: Add NCCL Alltoall to PT NCCL process group Test Plan: revert-hammer Differential Revision: D20781624 (`b87f0e5085`) Original commit changeset: 109436583ff6 fbshipit-source-id: 03f6ee4d56baea93a1cf795d26dd92b7d6d1df28	2020-07-22 13:22:17 -07:00

... 2 3 4 5 6 ...

1407 Commits