pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Edward Yang	90b08643c3	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-03 07:33:55 +00:00
PyTorch MergeBot	4e42aa8ffc	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit `b7034e9c92`. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))	2025-09-02 20:28:42 +00:00
Edward Yang	b7034e9c92	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-01 23:00:21 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	00ffeca1b1	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-21 04:23:29 +00:00
PyTorch MergeBot	6374332d33	Revert "PEP585 update - torch/distributed (#145164 )" This reverts commit `6cb186e279`. Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))	2025-01-20 16:46:46 +00:00
Aaron Orenstein	6cb186e279	PEP585 update - torch/distributed (#145164 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164 Approved by: https://github.com/bobrenjc93	2025-01-20 00:19:01 +00:00
Aaron Gokaslan	08db735629	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-03 02:50:10 +00:00
PyTorch MergeBot	daa77f3d9f	Revert "[BE]: Update mypy to 1.13.0 (#140808 )" This reverts commit `00134d68af`. Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))	2024-12-02 20:47:43 +00:00
Aaron Gokaslan	00134d68af	[BE]: Update mypy to 1.13.0 (#140808 ) Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-12-02 18:47:54 +00:00
lzhang2	1886e33f60	Use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. (#137678 ) # Motivation This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific. cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137678 Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5	2024-11-13 05:32:19 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
mingxzhao	1859895ffa	Docs: fix docstring errors in model_averaging (#117038 ) pydocstyle check averagers.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`: D400: First line should end with a period (not '`') 6 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`: D102: Missing docstring in public method /workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`: D107: Missing docstring in __init__ 4 utils.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`: D200: One-line docstring should fit on one line with quotes (found 3) 5 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level: D100: Missing docstring in public module 1 hierarchical_model_averager.py Pre /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`: D107: Missing docstring in __init__ /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D400: First line should end with a period (not ',') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D205: 1 blank line required between summary line and description (found 0) /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`: D400: First line should end with a period (not '`') 8 Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level: D100: Missing docstring in public module /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`: D107: Missing docstring in __init__ 2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038 Approved by: https://github.com/H-Huang	2024-01-18 04:12:51 +00:00
Edward Z. Yang	5a7aad9681	Convert logging f-strings to use % format, part four (#98705 ) This does multi-line concatenated string literals. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705 Approved by: https://github.com/voznesenskym	2023-04-11 13:17:59 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Rohan Varma	2b5625a726	Update hierarchical_model_averager.py (#85648 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/85648 Approved by: https://github.com/wayi1, https://github.com/H-Huang	2022-10-03 06:15:20 +00:00
anjali411	85073b8ddc	Add __all__ to fx, fistributed and cuda submodules (#85080 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85080 Approved by: https://github.com/albanD	2022-09-21 18:04:58 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
anjali411	3bcc19b29a	Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367 Approved by: https://github.com/albanD	2022-06-27 21:27:30 +00:00
Yi Wang	25fa6235f4	[Model Averaging] Make an error message more clear in hierarchical_model_averager.py As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/75832 Approved by: https://github.com/mrshenli	2022-04-26 15:20:51 +00:00
wayi1	e90580390d	[Model Averaging] Make the error message more informative in hierarchical_model_averager.py As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/76277 Approved by: https://github.com/rohan-varma	2022-04-24 15:10:19 +00:00
Alban Desmaison	da3c848dfa	Make distributed raise ImportError when not available Pull Request resolved: https://github.com/pytorch/pytorch/pull/75975 Approved by: https://github.com/mrshenli	2022-04-20 13:05:18 +00:00
Haijunlv	08f3b95857	fix PostLocalSGDOptimizer and ModelAverager average bug Fixes #74157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74894 Approved by: https://github.com/rohan-varma, https://github.com/wayi1	2022-04-13 11:41:27 +00:00
wayi1	4fb7fa081e	[Model Averaging] Code simplification for _find_process_group function (#75007 ) Summary: Previously the highest-level process group in `period_process_group_dict` could be `None`, indicating the global group. Now `period_process_group_dict` cannot contain `None` as a process group, so the function `_find_process_group` can just return a process group instead of a tuple -- when not found, just return `None`, because now the returned process group cannot be `None`. Proposal: https://github.com/pytorch/pytorch/issues/71325 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75007 Reviewed By: awgu Differential Revision: D35357816 Pulled By: rohan-varma fbshipit-source-id: 4522dba49797df7140227bfd822d668b7e118a66 (cherry picked from commit 77ca01b555d52685283c969176b08de4ff46c32d)	2022-04-04 20:31:22 +00:00
wayi1	5fbe8b1966	[Model Averaging] Make HierarchicalModelAverager a subclass of averagers.ModelAverager Make `HierarchicalModelAverager` a subclass of `averagers.ModelAverager` is a preparation step for incorporating hierarchical SGD into `PostLocalSGDOptimizer`. Proposal: https://github.com/pytorch/pytorch/issues/73382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74564 Approved by: https://github.com/rohan-varma	2022-03-24 21:52:00 +00:00
wayi1	5993f48711	[Model Averaging] Add a reference to hierarchical SGD (#73823 ) Summary: Add a reference. Also fix the comment: unlike `averagers.py`, currently this is not a base class that can inherit many subclasses. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73823 Reviewed By: ejguan Differential Revision: D34684366 Pulled By: rohan-varma fbshipit-source-id: e253ed39ba0783ad73bfd889e9a2e7d0c9214a3a (cherry picked from commit a9fec3585078881ccd5886ebb27e52b15f7181b1)	2022-03-08 05:56:17 +00:00
wayi1	0bb3b0652c	[Model Averaging] Support hierarchical model averaging (#73285 ) Summary: Implement hierarchical model averaging proposed in https://github.com/pytorch/pytorch/issues/71325. Unit tests are added. Since I don't have access to 4-GPU machines in open-source environment, expect that the branch with the prefix of `ci-all` can run the test that requires 4 GPUs. In the future, the internals of `PeriodicModelAveraging` can be simplified as an implementation of a specialized hierarchical model averaging, where `period_group_size_dict` only has a pair of period and world size. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73285 Reviewed By: mrshenli Differential Revision: D34457792 Pulled By: rohan-varma fbshipit-source-id: 39a6c5bf8a2852b6394a56abbad17b8a909b9fba (cherry picked from commit 5f543d46103edb515db199dbb80db43c85665f29)	2022-03-04 18:29:36 +00:00
wayi1	8b08478115	Fix the doc of PostLocalSGDState (#72792 ) Summary: The first arg of `PostLocalSGDState` ctor, `process_group`, cannot be empty. Here to simplify the usage, does not even create a subgroup explicitly. See the example in unit test: `4feef6c970/torch/testing/_internal/distributed/distributed_test.py (L4260)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/72792 Reviewed By: samdow Differential Revision: D34213221 Pulled By: rohan-varma fbshipit-source-id: 078343f3ee138e175bf835897f190032eb970662 (cherry picked from commit `bf90af704f`)	2022-02-15 23:47:12 +00:00
Rohan Varma	d8abe813bc	[LocalSGD] Move feature to Beta, clean up some docs (#71621 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71621 Moves this feature to beta as discussed, and cleans up some docs. Synced offline with wayi1 who mentioned that the current names are preferred as he works to prototype hierarchical allreduce as discussed in this RFC: https://github.com/pytorch/pytorch/issues/71325. ghstack-source-id: 147382940 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D33700444 fbshipit-source-id: 8eb543f5b02a119d0790a5c0919e6def6383a067 (cherry picked from commit `656e9809b2`)	2022-01-21 21:10:42 +00:00
Yi Wang	ed50a35cf8	[Model Averaging] Update the documentation of PeriodicModelAverager (#70974 ) Summary: Here 20 is a bad example, since the warmup step is set as 100. 200 iterations will make much more sense. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/70974 Reviewed By: dagitses Differential Revision: D33474576 Pulled By: rohan-varma fbshipit-source-id: 4c7043108897848bde9503d77999971ad5567aa6	2022-01-07 13:20:42 -08:00
Yi Wang	c1415a0a72	[Reland] [Model Averaging] Simplify PostLocalSGD Optimizer API (#65197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197 1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type. 2. The parameters are read from local optimizer's param_groups instead of a separate input. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 138307226 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D31007439 fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722	2021-09-17 10:31:58 -07:00
Yi Wang	00e6e0c593	[Model Averaging] Revert #63895 (#64903 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64903 Fix the accuracy regression caused by https://github.com/pytorch/pytorch/pull/63895. Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: rohan-varma Differential Revision: D30894688 fbshipit-source-id: fe00b8b23b860d9f806f87c1b6caba1d0b807485	2021-09-14 09:45:42 -07:00
Yi Wang	7edeead796	Add a comment on the potential implicit type up-casting (#63905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63905 as title ghstack-source-id: 136590703 Test Plan: N/A Reviewed By: mrshenli Differential Revision: D30527929 fbshipit-source-id: 69402bbfa87cfd8fc166ce313cde9736ee072589	2021-08-25 12:47:45 -07:00
Aayush Prakash	8a22d4fa5c	[Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136593433 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity Reviewed By: SciPioneer Differential Revision: D30526178 fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00	2021-08-25 11:12:55 -07:00
Edward Yang	699c764d2e	Revert D30513613: Removing tensor.data usage in utils with tensor set_ method Test Plan: revert-hammer Differential Revision: D30513613 (`d08a36f831`) Original commit changeset: 402efb9c30fa fbshipit-source-id: 911c66a9852de77dc5274b5fb373258c0c97739a	2021-08-24 12:20:37 -07:00
Aayush Prakash	d08a36f831	Removing tensor.data usage in utils with tensor set_ method (#63867 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867 When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future. The replacement is `tensor.set_`. ghstack-source-id: 136531233 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager Reviewed By: SciPioneer Differential Revision: D30513613 fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585	2021-08-24 11:20:44 -07:00
Yi Wang	9fee176be3	[Model Averaging] Fix docstring of PeriodicModelAverager (#62392 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392 The constructor of `PeriodicModelAverager` does not need to accept parameters. ghstack-source-id: 134626245 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29986446 fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76	2021-07-29 17:26:27 -07:00
Yi Wang	2eaf71d749	[Model Averaging] Update model averager API to avoid the redundant `params` arg needed by post-localSGD optimizer (#62132 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62132 as title Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134560541 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_post_localSGD_optimizer_parity buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29887751 fbshipit-source-id: 60dadb04790d800fdcc7cb8a08d060e411718739	2021-07-28 18:43:09 -07:00
Yi Wang	2581dfc249	[Model Averaging] Create a base class for model averaging (#62111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111 This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134489187 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29884954 fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938	2021-07-28 10:15:36 -07:00
Yi Wang	e856a45283	[Model Averaging] Refactor averagers to accept parameters instead of a module (#62105 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105 This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 134213572 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters Reviewed By: rohan-varma Differential Revision: D29883693 fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964	2021-07-23 18:39:45 -07:00
Yi Wang	df00c636d2	[Model Averaging] Skip model averaging for the first K steps (#61207 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207 Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase. Proposal: https://github.com/pytorch/pytorch/issues/59699 ghstack-source-id: 133371335 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: pritamdamania87 Differential Revision: D29523738 fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b	2021-07-10 17:12:16 -07:00
Yi Wang	5b6818f08a	[Model Averaging] Enforce a synchronization before allreduce parameters (#60891 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891 This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world. ghstack-source-id: 132564252 Test Plan: f281873295 (#Try1) failed due to the conflict between global process group and subgroup. ``` <Thread(configerator-monitor-singleton, started 139839806633728)> File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run self._target(self._args, *self._kwargs) File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop self._parent_thread.join(self._interval_ms / 1000) File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join self._wait_for_tstate_lock(timeout=max(timeout, 0)) File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock elif lock.acquire(block, timeout): ``` Fixed after adding an explicit sync: f282044866, f282241800 Reviewed By: rohan-varma Differential Revision: D29434597 fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33	2021-06-29 01:39:40 -07:00
Yi Wang	f262217101	[Model Averaging] Move step out of model averaging API (#60632 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60632 Address the comment https://github.com/pytorch/pytorch/pull/60320#discussion_r654845062 ghstack-source-id: 132340278 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29355609 fbshipit-source-id: 50a6f13ed70b5a5b5b92ead2f3d7082c11277af5	2021-06-25 17:20:52 -07:00
Yi Wang	80f40b172f	[Model Averaging] Periodic model averager (#60320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60320 This averager can be used for post-local SGD. ghstack-source-id: 131908011 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager Reviewed By: rohan-varma Differential Revision: D29249850 fbshipit-source-id: 09675d6bb1edfb8ffbeb94510d91962532d8ca3e	2021-06-23 20:23:04 -07:00
Yi Wang	aeea5bf4a1	[Model Averaging] Provide a util function for model averaging (#60303 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60303 The util function can be used for averaging parameters. More optimizations can be done in the future. ghstack-source-id: 132214212 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_average_parameters Reviewed By: rohan-varma Differential Revision: D29242806 fbshipit-source-id: 76fb5a92adb4bdc6151a9f411e366a0ed2a31f47	2021-06-23 15:41:15 -07:00

47 Commits