pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Andrew Gu	59dd84cab6	[Join][BE] Fix typo; remove obsolete method (#72886 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886 Test Plan Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34255651 Pulled By: awgu fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d (cherry picked from commit `4492f03a3f`)	2022-02-16 15:03:09 +00:00
Rohan Varma	aeacf910b5	[Checkpoint] Rename file (#72748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72748 Removes underscore from file/class as directory is already private ghstack-source-id: 149109295 Test Plan: Ci Reviewed By: samdow Differential Revision: D34179308 fbshipit-source-id: 8e956f3c83f21159c5e0fcdce09624ecb8a73ac0 (cherry picked from commit `adfd8bc357`)	2022-02-16 00:08:23 +00:00
wayi1	8b08478115	Fix the doc of PostLocalSGDState (#72792 ) Summary: The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly. See the example in unit test: `4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792 Reviewed By: samdow Differential Revision: D34213221 Pulled By: rohan-varma fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662 (cherry picked from commit `bf90af704f`)	2022-02-15 23:47:12 +00:00
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Brian Muse	8bf3179f6e	#71946 Remove Python 3.6 references (#72211 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/71946 This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete. Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211 Reviewed By: dagitses, seemethere Differential Revision: D33982604 Pulled By: musebc fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044 (cherry picked from commit `944a9970fe`)	2022-02-08 03:46:20 +00:00
Omar	25f9fe22a9	[PowerSGD] Add orthogonalization with QR factorization (#72043 ) Summary: ### 🚀 The feature, motivation and pitch Following the discussion in https://github.com/pytorch/pytorch/issues/65813, I added the QR factorization to powerSGD_hook.py Gram-Schmidt orthogonalization can't be fully replaced because _torch.linalg.qr_ doesn't work with half-precision. Moreover, in my tests, it works faster with a rank lesser than 3. This is one sample experiment timing powerSGD_hook on ResNext101 with the two different methods: ![Screenshot from 2022-01-31 18-14-00](https://user-images.githubusercontent.com/42100908/151840929-270c67dd-9fe7-4f11-8e70-8bf2d0ba678d.png) ### Alternatives Use _torch.orgqr(*torch.geqrf(matrix))_. From my tests it performances are similar to _torch.linalg.qr_. ### Additional context _No response_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/72043 Reviewed By: albanD Differential Revision: D34042781 Pulled By: cbalioglu fbshipit-source-id: e331179d3b7ac40d445b651fc473b16ae4ead462 (cherry picked from commit `f64bf3839a`)	2022-02-07 21:15:40 +00:00
Yanli Zhao	2336571cb7	make fsdp folder to be public (#72084 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72084 make fsdp folder to be public ghstack-source-id: 148173447 Test Plan: unit tests Reviewed By: mrshenli Differential Revision: D33903417 fbshipit-source-id: 7852a2adc4af09af48a5ffa52ebf210489f834d5 (cherry picked from commit `bd06513cfe`)	2022-02-02 15:50:14 +00:00
Rohan Varma	8fa5cde3a9	Fix hooks (#71970 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71970 - Provide default arg for power SGD convenience wrapper that matches the main API default Test Plan: CI Reviewed By: H-Huang Differential Revision: D33837457 fbshipit-source-id: 8f4efab4992b3fff09456a18db2c83e087c25bdf (cherry picked from commit `83f52fb3c7`)	2022-01-28 23:07:33 +00:00
Rohan Varma	bdcdf94bdd	[Opt Overlap] Clean up code in _OptimizerHookState (#71620 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620 Remove from_functional_optim and make it the default constructor since that is the only way _OptimizerHookState is now being built. Also, no longer need to expose create_functional_optim helper function ghstack-source-id: 147577174 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33700593 fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14 (cherry picked from commit `a50b2caf0e`)	2022-01-26 19:33:49 +00:00
Rohan Varma	1c8fcc44cb	[Opt Overlap] Support optimizing partial set of parameters (#71608 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608 Per title ghstack-source-id: 147577178 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33696382 fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f (cherry picked from commit `436b547fb0`)	2022-01-26 19:33:49 +00:00
Rohan Varma	8273912a8c	[Opt Overlap] Implement _OverlappedOptimizer (#71605 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71605 ghstack-source-id: 147577173 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33692686 fbshipit-source-id: b0fdb45245d923e1de8fef4431d3e235ac57dcbf (cherry picked from commit `8b83dbf690`)	2022-01-26 07:32:04 +00:00
Rohan Varma	f5a71ec2d6	[Opt Overlap] Implement as_functional_optim and create_functional_optim (#71604 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71604 Implement 2 helper functions: - as_functional_optim which takes in a torch.optim class type and arguments and creates the corresponding functional optimizer. - create_functional_optim which takes in the functional optimizer class type and constructs it. Note that as_functional_optim calls into create_functional_optim. The first will be used in future PRs as described in https://github.com/pytorch/pytorch/issues/67570 to create a functional optimizer from a traditional optimizer. The latter is used in _OptimizerHookState to create a functional optimizer. Both new helper functions are covered by unittests. ghstack-source-id: 147577170 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33688995 fbshipit-source-id: 8b2daafd1b914efa90877cc4313aa9a428546fc1 (cherry picked from commit `42fdae2991`)	2022-01-25 18:32:13 +00:00
Rohan Varma	281663955f	[Opt Overlap] Create Optimizer Hook State directly from functional optim (#71602 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71602 The design in https://github.com/pytorch/pytorch/issues/67570 requires `_OptimizerHookState` to be created directly from a functional optimizer. Add support and tests for this. Also refactor a few tests. ghstack-source-id: 147577175 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33687477 fbshipit-source-id: f3c789aa77773f918e01a8d0cf08739b2edf07b3 (cherry picked from commit `4851e1c6d4`)	2022-01-25 18:32:13 +00:00
Rohan Varma	9b3a56eecf	[Optimizer Overlap] Move hooks to own file (#71601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71601 Moves current prototype optimizer overlap to its own file for a better namespace. No code changes besides a few comment fixes. Note that this code is still prototype and not expected to be used by an end user. ghstack-source-id: 147458826 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D33662678 fbshipit-source-id: 3cc931323230a4b66c02b9e6f744aaf5c48d4d34 (cherry picked from commit `5070595c7f`)	2022-01-23 00:04:32 +00:00
Rohan Varma	d8abe813bc	[LocalSGD] Move feature to Beta, clean up some docs (#71621 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621 Moves this feature to beta as discussed, and cleans up some docs. Synced offline with wayi1 who mentioned that the current names are preferred as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325. ghstack-source-id: 147382940 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33700444 fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067 (cherry picked from commit `656e9809b2`)	2022-01-21 21:10:42 +00:00
Omar Younis	569aeec1bc	fix typo in debugging_hooks.py (#70956 ) Summary: I just fixed a small typo in the debugging_hooks documentation cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70956 Reviewed By: jbschlosser Differential Revision: D33508898 Pulled By: dagitses fbshipit-source-id: fc5935e5a2e2ddc45657a22d3b33a11aba378d9b	2022-01-10 12:59:42 -08:00
Yi Wang	ed50a35cf8	[Model Averaging] Update the documentation of PeriodicModelAverager (#70974 ) Summary: Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974 Reviewed By: dagitses Differential Revision: D33474576 Pulled By: rohan-varma fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6	2022-01-07 13:20:42 -08:00
Rohan Varma	a197f3fe52	[FSDP/Checkpoint] Activation offload support in checkpoint_wrapper (#70165 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165 Implements activation offload support in checkpoint_wrapper API via save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation and instead compose offload + checkpoint using the save_on_cpu hook for the former. ghstack-source-id: 146078900 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33228820 fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a	2021-12-21 10:08:18 -08:00
Rohan Varma	79a40b22aa	[Checkpoint] Make checkpoint_wrapper an nn.Module (#70164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164 Implement Alban's suggestion to make checkpoint_wrapper an nn.Module instead of patching the forward pass, which is too hacky. ghstack-source-id: 146011215 Test Plan: IC Reviewed By: mrshenli Differential Revision: D33214696 fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2	2021-12-20 13:22:28 -08:00
Rohan Varma	c4281cc92d	Prototype checkpoint_wrapper (#69955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955 Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module. Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case. Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint. ghstack-source-id: 145811242 Test Plan: CI Reviewed By: mrshenli Differential Revision: D33107276 fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb	2021-12-16 09:59:19 -08:00
Wanchao Liang	7c6a8a47db	[BE] minor improvement to dist quantization (#67401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401 some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup ghstack-source-id: 143910067 ghstack-source-id: 143910067 Test Plan: wait for ci Reviewed By: mrshenli Differential Revision: D31979269 fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289	2021-11-21 23:31:59 -08:00
Michael Suo	f50bf16c04	Revert D31663043: [BE] minor improvement to dist quantization Test Plan: revert-hammer Differential Revision: D31663043 Original commit changeset: 2f96b7346e9c fbshipit-source-id: d38684dfe79ca335fbbe624496ad4c86c29d1570	2021-10-22 16:37:41 -07:00
Wanchao Liang	7379d4db20	[BE] minor improvement to dist quantization (#66649 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649 some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup ghstack-source-id: 141336191 Test Plan: wait for ci Reviewed By: cbalioglu Differential Revision: D31663043 fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79	2021-10-22 13:46:28 -07:00
Yi Wang	c1415a0a72	[Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197 1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type. 2. The parameters are read from local optimizer's param_groups instead of a separate input. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 138307226 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D31007439 fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722	2021-09-17 10:31:58 -07:00
Yi Wang	00e6e0c593	[Model Averaging] Revert #63895 (#64903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903 Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895. Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D30894688 fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485	2021-09-14 09:45:42 -07:00
Yi Wang	bf9d66586c	[DDP Comm Hook] Create a noop hook for performance debugging (#64344 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64344 As title. Additionally, avoid using numpy array in test_ddp_hooks.py. ghstack-source-id: 137170449 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks -- test_ddp_comm_hook_noop_hook Reviewed By: rohan-varma Differential Revision: D30693220 fbshipit-source-id: e17f0d1c6198863cf20a53566f586a6bff602522	2021-09-01 17:36:22 -07:00
Marjan Fariborz	6a76ee04de	Adding alltoall_single collective to collective quantization API (#63154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154 The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included. ghstack-source-id: 136856877 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16 Reviewed By: wanchaol Differential Revision: D30255251 fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29	2021-08-27 12:46:31 -07:00
Marjan Fariborz	3b284ab024	Adding BFP16 quantization/dequantization support to OSS (#63059 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059 Supporting BFP16 quantization method to OSS. Currently only support CPU ghstack-source-id: 136639528 Test Plan: Imported from OSS Reviewed By: wanchaol Differential Revision: D30194538 fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53	2021-08-25 23:41:34 -07:00
Yi Wang	7edeead796	Add a comment on the potential implicit type up-casting (#63905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905 as title ghstack-source-id: 136590703 Test Plan: N/A Reviewed By: mrshenli Differential Revision: D30527929 fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589	2021-08-25 12:47:45 -07:00
Aayush Prakash	8a22d4fa5c	[Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136593433 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: SciPioneer Differential Revision: D30526178 fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00	2021-08-25 11:12:55 -07:00
Edward Yang	699c764d2e	Revert D30513613: Removing tensor.data usage in utils with tensor set_ method Test Plan: revert-hammer Differential Revision: D30513613 (`d08a36f831`) Original commit changeset: 402efb9c30fa fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a	2021-08-24 12:20:37 -07:00
Aayush Prakash	d08a36f831	Removing tensor.data usage in utils with tensor set_ method (#63867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136531233 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager Reviewed By: SciPioneer Differential Revision: D30513613 fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585	2021-08-24 11:20:44 -07:00
Marjan Fariborz	c545b099aa	Separating quantization test from distributed_test (#63058 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63058 Dedicating separate tests for different quantization methods. Currently supporting FP16 method. ghstack-source-id: 136499767 Test Plan: uck test mode/dev //caffe2/test/distributed/algorithms/quantization:quantization_gloo_fork -- name_of_the_test Reviewed By: wanchaol Differential Revision: D30142580 fbshipit-source-id: 3aacec1a231a662067d2b48c001f0c69fefcdd60	2021-08-24 01:44:55 -07:00
Yinbin Ma	0d437fe6d0	BF16 allreduce hook (#63260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260 Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7. Reviewed By: SciPioneer Differential Revision: D30238317 fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb	2021-08-18 20:53:49 -07:00
Yi Wang	979180cd01	[Model Averaging] Allow subgroup to be None in PostLocalSGDState (#63277 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277 `PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 135848575 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD Reviewed By: cbalioglu Differential Revision: D30325041 fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd	2021-08-16 10:07:41 -07:00
Andrew Gu	2d75703c6a	Remove req to call step() in training loop (#63164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63164 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284616 Pulled By: andwgu fbshipit-source-id: afdb677fb08851b139178a9f6d782196f26773e1	2021-08-13 08:22:44 -07:00
Andrew Gu	bd81c9178a	Simplify data structures, add uniform approximation, fix mem leak (#63162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63162 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30284617 Pulled By: andwgu fbshipit-source-id: 9bd9e5f89abcc0d3dac56b85d55cc88e843baa9f	2021-08-13 08:20:59 -07:00
Andrew Gu	1b1f1e36b4	Add ``allow_empty_param_list`` to functional optimizers (#62522 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62522 Addresses https://github.com/pytorch/pytorch/issues/62481 Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D30072074 Pulled By: andwgu fbshipit-source-id: 1a5da21f9636b8d74a6b00c0f029427f0edff0e3	2021-08-09 11:18:56 -07:00
Marjan Fariborz	c7db642a72	Adding collective quantization API (#62142 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142 Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization Test Plan: Tested through distributed_gloo_fork. e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized Reviewed By: wanchaol Differential Revision: D29682812 fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e	2021-08-09 08:11:22 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Andrew Gu	62a90c227f	Make _Join, _Joinable, _JoinHook public (#62605 ) Summary: Overview: This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026	2021-08-03 12:20:11 -07:00
Andrew Gu	43327cc197	Refactor commonalities between two approaches (#62624 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62624 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058543 Pulled By: andwgu fbshipit-source-id: 73c794062b75e011868fae264f592549eed67482	2021-08-03 08:43:14 -07:00
Andrew Gu	e6a3967c2a	Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62623 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D30058544 Pulled By: andwgu fbshipit-source-id: a56910f294c6a40118751eebe255b62700f42be9	2021-08-03 08:13:52 -07:00
Yi Wang	db071ef005	[Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592 Reland #62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. ghstack-source-id: 134848352 Test Plan: unit test Reviewed By: andwgu Differential Revision: D30049431 fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9	2021-08-02 16:38:09 -07:00
Yi Wang	2ec4f69b48	[DDP Comm Hook] Do not expose hook_then_optimizer as a public method (#62532 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532 This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature. ghstack-source-id: 134787831 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl Reviewed By: rohan-varma Differential Revision: D30031222 fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676	2021-08-02 12:25:01 -07:00
Eli Uriegas	6f95850127	Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class Test Plan: revert-hammer Differential Revision: D30024161 (`29c8b1db57`) Original commit changeset: 07e6072a2f7b fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de	2021-08-02 10:26:54 -07:00
Qing Hu	29c8b1db57	[DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. Test Plan: Local run comprehensive test with following results: https://pxl.cl/1Ml8b For two timeout failure test cases, most likely environment related and fail in my devserver. Reviewed By: SciPioneer Differential Revision: D30024161 fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f	2021-08-02 09:33:32 -07:00
Andrew Gu	51f687fd4b	Add overlap with DDP to ZeRO (two approaches) (#62157 ) Summary: Overview: This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration. Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157 Test Plan: The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass: - ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`) - `test_ddp_with_zero_step_parity_gpu` - `test_ddp_with_zero_step_interleaved_parity_gpu` These were tested on the AI AWS cluster. An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302. Both approaches have been verified using an internal accuracy benchmark. Reviewed By: mrshenli Differential Revision: D29971046 Pulled By: andwgu fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8	2021-08-02 08:33:34 -07:00
Yi Wang	32b37ba246	[DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457 Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor. Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`. ghstack-source-id: 134771419 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type Reviewed By: rohan-varma Differential Revision: D30007390 fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f	2021-07-30 20:51:34 -07:00
Yi Wang	acba9b3104	[DDP Communication Hook] Simplify the implementation of parseHookResult of PythonCommHook (#62389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389 Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list. Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280 Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`. ghstack-source-id: 134626246 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork Reviewed By: rohan-varma Differential Revision: D29982485 fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a	2021-07-29 17:27:35 -07:00

1 2 3

143 Commits