pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Pritam Damania	9dfeec9cdc	Add a mode to avoid clone() in DDPSink (#122927 ) DDPSink clones the outputs of DDP to avoid in-place modification of loss (see https://github.com/pytorch/pytorch/issues/61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory. As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122927 Approved by: https://github.com/fegin, https://github.com/rohan-varma	2024-04-12 08:56:10 +00:00
Chien-Chin Huang	b279034e5a	[DDP][PT2D] Add the trace rules for DDP (#121741 ) Add the trace rules for DDP and refactor the tests to verify both DDP and replicate. Differential Revision: [D54815909](https://our.internmc.facebook.com/intern/diff/D54815909/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121741 Approved by: https://github.com/yf225 ghstack dependencies: #123206, #123207	2024-04-08 19:53:13 +00:00
Chien-Chin Huang	6a3b47ec8f	[PT2D][DDP] Remove the hack to pass None as the process group (#123207 ) Functional collectives can now handle None as the process group. Differential Revision: [D55658338](https://our.internmc.facebook.com/intern/diff/D55658338/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123207 Approved by: https://github.com/kwen2501 ghstack dependencies: #123206	2024-04-08 19:24:29 +00:00
Chien-Chin Huang	c7193f4099	[DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479 ) This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict. This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [x] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of all_reduces. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479 Approved by: https://github.com/wz337 ghstack dependencies: #113209	2024-03-13 21:41:22 +00:00
Chien-Chin Huang	8e6d572b4e	[DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209 ) Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/) TL;DR This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion. This PR does not invent any algorithm and simply reflects the bucket size users set to DDP. Implementation Details Fusion with concat op The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient. Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer. Fusion with `all_reduce_coalesced` The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling. Limitations Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [ ] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of `all_reduce`s. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209 Approved by: https://github.com/yf225	2024-03-13 20:37:09 +00:00
Chien-Chin Huang	3179107629	[DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419 ) From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case. Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419 Approved by: https://github.com/yf225, https://github.com/XilunWu	2024-02-29 00:27:54 +00:00
Chien-Chin Huang	1d2382f141	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab	2024-02-08 03:03:15 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
Ke Wen	58c4bc62bb	[c10d] Deprecate Work.result() (#117565 ) Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather). It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs. Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565 Approved by: https://github.com/wconstab	2024-01-18 01:22:37 +00:00
Aaron Gokaslan	bbe3261dd3	[BE]: Use `iterable.chain.from_iterable` where possible (#116376 ) This is more readable and more efficient when dealing with lots of sequences to chain together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376 Approved by: https://github.com/albanD	2023-12-27 19:20:07 +00:00
Albert Zeyer	3642f29a64	DistributedDataParallel._post_forward, fix return (#114678 ) Fix `return` in case of `_delay_all_reduce_all_params`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114678 Approved by: https://github.com/Skylion007, https://github.com/fegin	2023-12-06 23:44:52 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
wz337	7b3e45be59	[DeviceMesh] Rename get_dim_groups to get_group (#114708 ) Rename get_dim_groups to get_group and update all callsites. Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin	2023-11-30 23:40:14 +00:00
Pritam Damania	f505d76462	Bug fixes to DDP _update_process_group API. (#114194 ) https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state. As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194 Approved by: https://github.com/rohan-varma	2023-11-27 23:52:40 +00:00
Pritam Damania	17e2313dd3	Add an API to DDP for dynamically updating the underlying process group. (#113580 ) # Motivation If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following: ``` del old_ddp del old_pg pg = init_pg(...) ddp = DDP(pg) model = torch.compile(DDP) ``` This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again. # Proposal As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580 Approved by: https://github.com/fduwjj	2023-11-15 09:05:02 +00:00
wz337	f2963642c2	[DDP] Add device_mesh to DDP ctor (#112761 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112761 Approved by: https://github.com/fegin	2023-11-08 03:08:08 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
NVS Abhilash	db66f15785	docs: fix docstrings in distributed.py and others (fixes #112604 ) (#112657 ) Fixes #112604 Fixes docstring by following `pydocstyle` outputs. - torch/nn/parallel/distributed.py Before: 84 ``` torch/nn/parallel/distributed.py:1 at module level: D100: Missing docstring in public module torch/nn/parallel/distributed.py:92 in private function `_cast_buffers`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`: D401: First line should be in imperative mood (perhaps 'Create', not 'Creates') torch/nn/parallel/distributed.py:143 in private function `_find_tensors`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:273 in private method `__init__`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:273 in private method `__init__`: D401: First line should be in imperative mood (perhaps 'Set', not 'Sets') torch/nn/parallel/distributed.py:287 in private method `main_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:287 in private method `main_hook`: D400: First line should end with a period (not 'd') torch/nn/parallel/distributed.py:324 in private method `post_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:324 in private method `post_hook`: D400: First line should end with a period (not 'l') torch/nn/parallel/distributed.py:324 in private method `post_hook`: D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs') torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`: D400: First line should end with a period (not 'n') torch/nn/parallel/distributed.py:633 in public method `__init__`: D107: Missing docstring in __init__ torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`: D401: First line should be in imperative mood (perhaps 'Fire', not 'Fires') torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`: D400: First line should end with a period (not 's') torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`: D400: First line should end with a period (not 'e') torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`: D400: First line should end with a period (not ':') torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`: D401: First line should be in imperative mood (perhaps 'Initialize', not 'Initialization') torch/nn/parallel/distributed.py:1146 in public method `__getstate__`: D105: Missing docstring in magic method torch/nn/parallel/distributed.py:1154 in public method `__setstate__`: D105: Missing docstring in magic method torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`: D400: First line should end with a period (not 'o') torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`: D401: First line should be in imperative mood (perhaps 'Assign', not 'Assigns') torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`: D400: First line should end with a period (not 's') torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/nn/parallel/distributed.py:1312 in public method `no_sync`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1312 in public method `no_sync`: D400: First line should end with a period (not 'P') torch/nn/parallel/distributed.py:1312 in public method `no_sync`: D401: First line should be in imperative mood; try rephrasing (found 'A') torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`: D403: First word of the first line should be properly capitalized ('Torchdynamo', not 'TorchDynamo') torch/nn/parallel/distributed.py:1517 in public method `forward`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1527 in public method `scatter`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1530 in public method `to_kwargs`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1539 in public method `gather`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1542 in public method `train`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1617 in public method `join`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1617 in public method `join`: D400: First line should end with a period (not 'f') torch/nn/parallel/distributed.py:1617 in public method `join`: D401: First line should be in imperative mood; try rephrasing (found 'A') torch/nn/parallel/distributed.py:1723 in public method `join_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1723 in public method `join_hook`: D400: First line should end with a period (not 'y') torch/nn/parallel/distributed.py:1723 in public method `join_hook`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/nn/parallel/distributed.py:1752 in public method `join_device`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1756 in public method `join_process_group`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`: D400: First line should end with a period (not 'e') torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`: D401: First line should be in imperative mood (perhaps 'Allow', not 'Allows') torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`: D400: First line should end with a period (not 'a') torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`: D401: First line should be in imperative mood (perhaps 'Register', not 'Registers') torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`: D400: First line should end with a period (not 'P') torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`: D401: First line should be in imperative mood (perhaps 'Register', not 'Registers') torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`: D400: First line should end with a period (not 'a') torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`: D401: First line should be in imperative mood (perhaps 'Register', not 'Registers') torch/nn/parallel/distributed.py:2005 in public method `will_sync_module_buffers`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`: D400: First line should end with a period (not 'e') torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`: D400: First line should end with a period (not 'r') torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`: D401: First line should be in imperative mood (perhaps 'Set', not 'Sets') torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`: D400: First line should end with a period (not 's') torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`: D401: First line should be in imperative mood; try rephrasing (found 'This') torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`: D400: First line should end with a period (not 'g') torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`: D401: First line should be in imperative mood; try rephrasing (found 'This') torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`: D400: First line should end with a period (not 'l') torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`: D401: First line should be in imperative mood; try rephrasing (found 'It') torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`: D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes') torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`: D205: 1 blank line required between summary line and description (found 0) torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`: D400: First line should end with a period (not 'd') torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`: D401: First line should be in imperative mood (perhaps 'Check', not 'Checks') 84 ``` After: 12 ``` torch/nn/parallel/distributed.py:1 at module level: D100: Missing docstring in public module torch/nn/parallel/distributed.py:618 in public method `__init__`: D107: Missing docstring in __init__ torch/nn/parallel/distributed.py:1133 in public method `__getstate__`: D105: Missing docstring in magic method torch/nn/parallel/distributed.py:1141 in public method `__setstate__`: D105: Missing docstring in magic method torch/nn/parallel/distributed.py:1503 in public method `forward`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1513 in public method `scatter`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1516 in public method `to_kwargs`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1525 in public method `gather`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1528 in public method `train`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1734 in public method `join_device`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1738 in public method `join_process_group`: D102: Missing docstring in public method torch/nn/parallel/distributed.py:1986 in public method `will_sync_module_buffers`: D102: Missing docstring in public method 12 ``` - torch/nn/utils/_named_member_accessor.py Before: 23 ``` torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`: D400: First line should end with a period (not 's') torch/nn/utils/_named_member_accessor.py:115 in public method `__init__`: D107: Missing docstring in __init__ torch/nn/utils/_named_member_accessor.py:122 in public method `get_submodule`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:155 in public method `swap_submodule`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:164 in public method `get_tensor`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:185 in public method `set_tensor`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:194 in public method `del_tensor`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:211 in public method `swap_tensor`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:224 in public method `get_tensors`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:233 in public method `set_tensors`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:249 in public method `set_tensors_dict`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:261 in public method `del_tensors`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:276 in public method `swap_tensors`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:296 in public method `swap_tensors_dict`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_named_member_accessor.py:325 in public method `check_keys`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/utils/_named_member_accessor.py:340 in public method `named_parameters`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/utils/_named_member_accessor.py:349 in public method `named_buffers`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/utils/_named_member_accessor.py:358 in public method `named_tensors`: D200: One-line docstring should fit on one line with quotes (found 3) torch/nn/utils/_named_member_accessor.py:368 in public method `named_modules`: D200: One-line docstring should fit on one line with quotes (found 3) 23 ``` After: 4 ``` torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`: D103: Missing docstring in public function torch/nn/utils/_named_member_accessor.py:116 in public method `__init__`: D107: Missing docstring in __init__ 4 ``` - torch/nn/utils/_per_sample_grad.py Before: 3 ``` torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`: D400: First line should end with a period (not ')') torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`: D402: First line should not be the function's "signature" 3 ``` After: 0 ``` 0 ``` - torch/nn/utils/init.py Before: 3 ``` torch/nn/utils/init.py:1 at module level: D100: Missing docstring in public module torch/nn/utils/init.py:6 in public function `skip_init`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/init.py:6 in public function `skip_init`: D400: First line should end with a period (not 'g') 3 ``` After: 1 ``` torch/nn/utils/init.py:1 at module level: D100: Missing docstring in public module 1 ``` - torch/nn/utils/memory_format.py Before: 4 ``` torch/nn/utils/memory_format.py:1 at module level: D100: Missing docstring in public module torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`: D202: No blank lines allowed after function docstring (found 1) torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`: D205: 1 blank line required between summary line and description (found 0) torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`: D400: First line should end with a period (not '`') 4 ``` After: 1 ``` torch/nn/utils/memory_format.py:1 at module level: D100: Missing docstring in public module 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112657 Approved by: https://github.com/fduwjj	2023-11-02 05:52:47 +00:00
Oleg Bulatov	192477b5ba	Enable flake8-bugbear B020 lint (#110823 ) Fixes part of https://github.com/pytorch/pytorch/issues/106571 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823 Approved by: https://github.com/Skylion007	2023-10-24 22:43:47 +00:00
Rohan Varma	24e5d61af8	Log usage of optimizer in backward (#110206 ) This will allow us to inspect and aggregate jobs that use optimizer in backward Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206 Approved by: https://github.com/awgu	2023-09-29 11:00:07 +00:00
Andrei Gheorghe	6275f91654	Improved DDP checkpoint documentation (#106985 ) Amended the documentation for the specified case. Fixes #84589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106985 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-09-25 22:54:24 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Rohan Varma	c11412b4a8	[DDP] Support optim in backward after DDP init (#105995 ) This allows in backward optimizers to be configured after DDP init, in addition to before as was previously supported. Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995 Approved by: https://github.com/fegin	2023-07-29 01:36:25 +00:00
Justin Chu	79c5e33349	[BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436 Approved by: https://github.com/malfet, https://github.com/albanD	2023-07-21 07:38:46 +00:00
Animesh Jain	0444f9f85b	[dynamo] Reland #104317 - Lazy disable_dynamo API out-of-dynamo (#104664 ) Internal failed because of torch.deploy issues with disable_dynamo in fx/* and _jit/* files. Removing disable_dynamo for both. Added a comment in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104664 Approved by: https://github.com/wconstab	2023-07-06 00:48:02 +00:00
PyTorch MergeBot	54e320d4d1	Revert "[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317 )" This reverts commit `5c12a810ac`. Reverted https://github.com/pytorch/pytorch/pull/104317 on behalf of https://github.com/huydhn due to This has been reverted internally by D47166892, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104317#issuecomment-1621099151))	2023-07-05 06:21:48 +00:00
Animesh Jain	5c12a810ac	[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104317 Approved by: https://github.com/jansel, https://github.com/wconstab, https://github.com/mlazos	2023-06-29 13:30:17 +00:00
Howard Huang	9165d46b89	DDP + C10D sparse all_reduce changes (#103916 ) (#104256 ) Summary: reland of https://github.com/pytorch/pytorch/pull/103916 ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D47056695 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256 Approved by: https://github.com/rohan-varma	2023-06-28 00:37:52 +00:00
PyTorch MergeBot	436d035dc7	Revert "DDP + C10D sparse all_reduce changes (#103916 )" This reverts commit `fed5fba6e4`. Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))	2023-06-26 22:37:58 +00:00
Howard Huang	fed5fba6e4	DDP + C10D sparse all_reduce changes (#103916 ) Summary: ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D46724856 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916 Approved by: https://github.com/rohan-varma	2023-06-26 20:42:17 +00:00
Rohan Varma	f044613f78	Back out "Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 )" (#103938 ) Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938 Approved by: https://github.com/awgu, https://github.com/fegin	2023-06-22 21:55:58 +00:00
Huy Do	b1ddd5a293	Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 ) Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack. This seems like a safer option than using the bot as the commit has already been in trunk since last week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873 Approved by: https://github.com/rohan-varma	2023-06-20 16:25:00 +00:00
Rohan Varma	80139fc2db	[DDP] multiple forward support for static graph (#103487 ) Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487 Approved by: https://github.com/awgu	2023-06-14 16:14:52 +00:00
Rohan Varma	780b24b27c	[DDP] Refactor _DDPSink to take DDP weakref (#103304 ) This will make future PRs to support DDP static graph multi forward cleaner. Differential Revision: [D46584545](https://our.internmc.facebook.com/intern/diff/D46584545/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103304 Approved by: https://github.com/awgu	2023-06-14 16:14:52 +00:00
Rohan Varma	a3a32c1be0	[DDP] Rename num_iterations -> num_forward_calls (#103283 ) This more accurately represents what we're counting. At iteration is a forward + backward call, but here we're just counting forward calls. This makes things less confusing in future diffs where we support DDP static graph multiple forwards. Differential Revision: [D46580601](https://our.internmc.facebook.com/intern/diff/D46580601/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103283 Approved by: https://github.com/awgu	2023-06-14 16:14:50 +00:00
Rohan Varma	2076a2ffa7	[DDP] Rename state_dict var to ddp_state (#103282 ) This name is confusing in the context that it is just a dictionary used to pass state to DDP backward pass. Differential Revision: [D46580516](https://our.internmc.facebook.com/intern/diff/D46580516/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103282 Approved by: https://github.com/awgu	2023-06-14 16:14:49 +00:00
Rohan Varma	88ce6215f5	[FSDP/DDP] Unify _cast_forward_inputs (#102680 ) Closes https://github.com/pytorch/pytorch/issues/96380 Differential Revision: [D46342814](https://our.internmc.facebook.com/intern/diff/D46342814/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102680 Approved by: https://github.com/awgu	2023-06-04 18:31:21 +00:00
Pritam Damania	9a2df0a5af	[RFC] Add method to DDP to check for backward finalization. (#100773 ) Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck. As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction. Test Plan: Added unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773 Approved by: https://github.com/rohan-varma	2023-05-31 20:43:06 +00:00
Matthew Hoffman	c28f8e314d	Add type hints in torch/distributed/utils.py (#102262 ) Fixes #77190 Pretty similar to the typing in `torch/nn/parallel`, which was also improved recently: #102194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102262 Approved by: https://github.com/Skylion007, https://github.com/Neilblaze	2023-05-30 19:57:45 +00:00
Aaron Gokaslan	3e2ea32dab	[BE]: Enable ruff rule TRY302 and apply fixes (#101874 ) Removes useless try statements and unreachable code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874 Approved by: https://github.com/malfet	2023-05-19 17:30:52 +00:00
Xing Liu	0731420645	[PyTorch/Distributed]Only sync buffers when broadcast_buffers is True (#100729 ) Summary: Disable buffers sync in _sync_module_states(...) when broadcast_buffers is False. This change will memory usage when a model has huge buffers and does not need broadcast buffers. Test Plan: . Differential Revision: D45610709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100729 Approved by: https://github.com/mrshenli	2023-05-08 16:34:29 +00:00
Rohan Varma	87db02ea38	[DDP] Perform input casting in pre forward (#100131 ) This is so that replicate can also have the feature to cast its inputs, which it currently does not. Next diff will change replicate pre hook to support this. Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131 Approved by: https://github.com/zhaojuanmao	2023-04-27 17:34:46 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Rohan Varma	bba2090831	Enable fused optimizer for DP (#98270 ) Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270 Approved by: https://github.com/awgu	2023-04-13 20:16:32 +00:00
Kazuaki Ishizaki	a531a464fd	Fix typos under torch/nn directory (#97594 ) This PR fixes typos in comments of `.py` files under `torch/nn` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/97594 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-04-10 22:07:15 +00:00
Edward Z. Yang	9a8f71f23e	Convert logging f-strings to use % format (#98697 ) Codemod done with https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with assistance from ChatGPT. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697 Approved by: https://github.com/voznesenskym	2023-04-10 12:19:31 +00:00
feifan	d95ee64b58	ddp forward support custom backend. (#98283 ) Currently DDP only considers CUDA backend，DDP forward will transfer tensor to CUDA. We want ddp to run on custom backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98283 Approved by: https://github.com/ezyang	2023-04-09 01:30:42 +00:00
Sergii Dymchenko	477f3f555f	Simplify by using yield from (#97831 ) The issues were found by SIM104 flake8-simplify in a local run. I'll take a look on adding the check to the CI separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97831 Approved by: https://github.com/Skylion007	2023-03-29 19:15:24 +00:00
Charlie Yan	44e73db3c2	[2/n] Consolidate `replicate` and `DDP`: split `forward` function (#96658 ) Split `forward` function into `pre_forward` and `post_forward`, so that they can be reused in the composable API of `replicate`. Differential Revision: [D44377456](https://our.internmc.facebook.com/intern/diff/D44377456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96658 Approved by: https://github.com/rohan-varma	2023-03-29 13:57:16 +00:00
Pritam Damania	e20e5f5578	[RFC] Add an API to remove autograd hooks from DDP (#96490 ) Summary: When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time. As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP Test Plan: Unit test added Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-03-21 02:56:16 +00:00
Charlie Yan	13538c88b3	[1/n] Consolidate `replicate` and `DDP`: setup ufmt for `distributed.py` (#96597 ) As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597 Approved by: https://github.com/rohan-varma	2023-03-17 06:25:11 +00:00
Rohan Varma	71adb32ddc	[DDP] API to get data parallel parameters (#95097 ) Add a private API to retrieve data parallel parameters. This is useful for example for apply_optimizer_in_backward in the case user wishes to ensure it is applied only on DDP managed parameters. Differential Revision: [D43383878](https://our.internmc.facebook.com/intern/diff/D43383878/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95097 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2023-03-16 00:30:37 +00:00
Xinfeng	906a1952c6	[DDP] Enable delayed all reduce in DDP (#96673 ) Summary: Enable the functionality of delaying all reduce in DDP to specify the parameters whose all reduce will be hooked to a specific param. This prevents AllReduce blocking All2All in some recommendation models. Test Plan: GitHub CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96673 Approved by: https://github.com/zhaojuanmao	2023-03-14 04:25:25 +00:00
Rohan Varma	32f11f58c9	DDP native mixed precision (#92882 ) Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows: 1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed. 2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously. 3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision. 4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves. 5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs. 6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback. 7. DDP Ignored parameters are not touched. Follow-ups: 1. Unify comm hooks and make it work with apply optimizer in backward 2. implement keep_low_precision_grads, 3. allow BN, LN, or custom units to run in reduced precision, 4. support for cast_forward_inputs 5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs 6. Integrate this with replicate() API. 7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order. 8. Entirely unused modules probably don't need to be cast. Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882 Approved by: https://github.com/zhaojuanmao	2023-03-13 14:10:31 +00:00
fduwjj	a88bfc60c7	[2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450 ) No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450 Approved by: https://github.com/wanchaol	2023-02-26 03:03:37 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
Aaron Gokaslan	1e2d82b8e4	[BE] Merge isinstance calls together (#94419 ) Simplify and speeds up isinstance calls by checking for multiple types at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419 Approved by: https://github.com/ezyang	2023-02-09 00:47:26 +00:00
Aaron Gokaslan	748bac8757	[BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309 ) Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309 Approved by: https://github.com/albanD	2023-02-07 20:08:58 +00:00
Rohan Varma	264c89658b	Move in backward opt setup to helper (#92059 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92059 Approved by: https://github.com/awgu	2023-02-02 23:57:14 +00:00
Rohan Varma	975feb606e	[DDP][Easy] Remove unused var (#93128 ) removes this unused var, the overall buffer comm hook feature is also not being used, we should deprecate / remove it as it is still a private API. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/93128 Approved by: https://github.com/awgu	2023-01-27 18:08:29 +00:00
Shen Li	0035340488	Allow DDP to handle custom dataclass forward outputs (#92334 ) Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92334 Approved by: https://github.com/zhaojuanmao	2023-01-18 14:51:37 +00:00
kshitij12345	745fe35df5	[follow-up] Python Attr Serialization (#88913 ) Ref: https://github.com/pytorch/pytorch/pull/81616#issuecomment-1307595402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88913 Approved by: https://github.com/albanD	2023-01-13 17:38:51 +00:00
joncrall	ad782ff7df	Enable xdoctest runner in CI for real this time (#83816 ) Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816 Approved by: https://github.com/ezyang, https://github.com/malfet	2022-12-29 05:32:42 +00:00
Rohan Varma	e8bf7c21e4	Integrate apply_optim_in_backward with DDP (#89194 ) Allow _apply_optim_in_backward to work with DDP. Example: ``` dist.init_process_group("nccl", rank=rank, world_size=2) torch.cuda.set_device(rank) e = enc().cuda(rank) _apply_optimizer_in_backward( optimizer_class=torch.optim.SGD, params=e.parameters(), optimizer_kwargs={"lr": 0.03}, ) e = DDP(e, device_ids=[rank]) inp = torch.randn(1, 10, device=rank) e(inp).sum().backward() ``` Constraints: 1. Custom communication hook not yet supported 2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP. 3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used. 4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.) Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194 Approved by: https://github.com/zhaojuanmao	2022-12-21 07:35:19 +00:00
Sergii Dymchenko	9ef1d55e6b	Fix non-existing parameters in docstrings in torch/nn (#90596 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90596 Approved by: https://github.com/lezcano	2022-12-10 14:37:31 +00:00
Ram Rachum	77f9b2e8bf	Fix exception causes in fx, nn and onnx packages (#90134 ) This is a continuation of #90118 @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90134 Approved by: https://github.com/kit1980	2022-12-06 04:34:58 +00:00
Andrew Gu	bfffc8d8ef	[DDP][Docs] Add warning that `no_sync()` should include forward (#89244 ) The issue where the user only includes `loss.backward()` inside `no_sync()` but not the forward pass has arisen several times now. I think adding an explicit warning in the docs is worthwhile. Rendered doc: <img width="769" alt="Screen Shot 2022-11-17 at 9 21 32 PM" src="https://user-images.githubusercontent.com/31054793/202602005-22c000b7-1093-4eaf-ba66-9c929a66906b.png"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/89244 Approved by: https://github.com/zhaojuanmao	2022-11-18 22:06:24 +00:00
Colin Taylor	24b9890f03	[torchrec] [composable] update ShardedEmbeddingBagCollection to be use registered EBCs with shardedTensors as registered modules (#758 ) (#88026 ) Summary: X-link: https://github.com/pytorch/torchrec/pull/758 This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore. this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals) It defines device of ShardedTensor to be None when local_tensor() does not exist on rank update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit Differential Revision: D40458625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88026 Approved by: https://github.com/wanchaol, https://github.com/rohan-varma	2022-11-17 04:26:13 +00:00
Charlie Yan	8523c45717	Delete stub file to enable mypy check (#4649 ) (#88701 ) Summary: X-link: https://github.com/facebookresearch/detectron2/pull/4649 Context in https://fburl.com/4irjskbe This change deletes distributed.pyi, so that lintrunner will run mypy on distributed.py for typing check. Test Plan: CI Differential Revision: D41028360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88701 Approved by: https://github.com/zhaojuanmao	2022-11-09 20:29:34 +00:00
Will Constable	678d038001	Support DDP ignored parameters in DDPOptimizer (#88460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88460 Approved by: https://github.com/aazzolini	2022-11-04 21:42:15 +00:00
Kazuaki Ishizaki	2ddefbdc3c	Fix typos used in documents under torch directory (#88300 ) This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300 Approved by: https://github.com/lezcano	2022-11-02 09:38:13 +00:00
Horace He	12dd877395	Fix all references to torchdynamo from the merge (#87731 ) cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87731 Approved by: https://github.com/yanboliang, https://github.com/ezyang, https://github.com/anijain2305, https://github.com/jansel	2022-10-31 06:51:07 +00:00
PyTorch MergeBot	641d8e0e69	Revert "Enable mypy check for distributed.py, and fix type errors (#87543 )" This reverts commit `2cc624cd43`. Reverted https://github.com/pytorch/pytorch/pull/87543 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-10-28 02:20:25 +00:00
Charlie Yan	2cc624cd43	Enable mypy check for distributed.py, and fix type errors (#87543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87543 Approved by: https://github.com/fduwjj	2022-10-27 00:22:54 +00:00
Charlie Yan	0294787bd6	Format distributed.py (#87667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87667 Approved by: https://github.com/zhaojuanmao	2022-10-26 06:02:30 +00:00
Charlie Yan	bebd162249	Fix doc of DDP (#86244 ) (#86256 ) [ghstack-poisoned] Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86256 Approved by: https://github.com/rohan-varma	2022-10-06 00:48:56 +00:00
Rohan Varma	be4e43c7d0	Remove DataParallel remnants from DDP doc (#86221 ) As @aazzolini pointed out, the docstring is incorrect and probably vestige from DP / single process multi device mode in DDP. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/86221 Approved by: https://github.com/aazzolini	2022-10-05 22:30:02 +00:00
Will Constable	32fc0b958e	Expose get_active_ddp_module api for torchdynamo DDP (#83333 ) Pairs up with torchdynamo PR https://github.com/pytorch/torchdynamo/pull/628 Exposes a new API that lets torchdynamo know when it is compiling the 'forward' of a module that is inside a DDPmodule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83333 Approved by: https://github.com/mrshenli	2022-09-17 02:10:25 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
Hubert Lu	cd18b78daa	[ROCm] Enable bf16-related tests in test_c10d_nccl.py and test_grad_layout_1devicemodule_1replicaperprocess (#82020 ) ### Description Enable bf16-related unit tests in test_c10d_nccl.py and test_grad_layout_1devicemodule_1replicaperprocess as follows: - distributed/test_c10d_nccl test_bf16_compress_wrapper_is_view (main.DistributedDataParallelTest) - distributed/test_c10d_nccl test_bf16_compress_wrapper_nccl (main.DistributedDataParallelTest) - distributed/test_c10d_nccl test_grad_layout_1devicemodule_1replicaperprocess (main.DistributedDataParallelTest) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82020 Approved by: https://github.com/ezyang	2022-08-11 21:16:33 +00:00
Yi Wang	08d54b5cd5	Correct DDP example (#83034 ) remove undefined `pg` from DDP example code Pull Request resolved: https://github.com/pytorch/pytorch/pull/83034 Approved by: https://github.com/mrshenli	2022-08-09 18:58:33 +00:00
ProGamerGov	71d50f4f89	Change docstring type callable to Callable for consistency (#82487 ) ### Description Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function. ### Testing There shouldn't be any testing required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487 Approved by: https://github.com/albanD	2022-08-01 17:26:09 +00:00
anjali411	3bcc19b29a	Add __all__ to various submodules in torch.fx, distributions, distributed, package (#80367 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/80367 Approved by: https://github.com/albanD	2022-06-27 21:27:30 +00:00
Rohan Varma	e7cb44b6c4	Guard distributed imports (#77727 ) Move distributed import after dist.is_avail check to fix builds with USE_DISTRIBUTED=0. Although, note that this issue is not caught by any CI at the moment. Closes https://github.com/pytorch/pytorch/issues/77704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77727 Approved by: https://github.com/malfet	2022-05-18 11:27:52 +00:00
Rohan Varma	6f954d7bbb	FSDP parameter sync Pull Request resolved: https://github.com/pytorch/pytorch/pull/77492 Approved by: https://github.com/zhaojuanmao	2022-05-17 19:58:49 +00:00
Rohan Varma	bbb1f106c7	Separate input moving to utils file Pull Request resolved: https://github.com/pytorch/pytorch/pull/77187 Test fix Pull Request resolved: https://github.com/pytorch/pytorch/pull/77235 Lint fix Approved by: https://github.com/awgu	2022-05-11 21:55:38 +00:00
Rohan Varma	ffb0946504	Generalize param verification and broadcast New PR for https://github.com/pytorch/pytorch/pull/75970 to be compatible with GHF. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76374 Approved by: https://github.com/awgu	2022-04-26 22:25:53 +00:00
pritam	b26df43f15	Fix bug where __getstate__ of DDP looks for self._replicated_tensor_module Pull Request resolved: https://github.com/pytorch/pytorch/pull/76349 When we are not using ReplicatedTensor in DDP and try to save a DDP module it will error out since it tries to delete the _replicated_tensor_module attribute. Fixing this by checking if this mode is enabled before triggering the delete. Differential Revision: [D35875167](https://our.internmc.facebook.com/intern/diff/D35875167/) Approved by: https://github.com/mrshenli, https://github.com/zhaojuanmao	2022-04-26 02:49:49 +00:00
pritam	3a38f175dd	Convert DDP parameters to ReplicatedTensor during forward pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753 As per the design in https://github.com/pytorch/pytorch/issues/72138, convert DDP parameters to ReplicatedTensor during its forward pass. Concretely, this is done as follows: 1) Create a separate `_replicated_tensor_module` which is a copy of self.module without creating copies of the Tensors themselves. 2) Use `_replicated_tensor_module` instead of `self.module` during the forward pass. 3) Have a context manager `_ddp_replicated_tensor` to enable this, since certain edge cases can fail where self.module is changed out of band resulting in discrepancy between self.module and `_replicated_tensor_module`. Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/) Approved by: https://github.com/wanchaol, https://github.com/rohan-varma	2022-04-18 03:27:23 +00:00
Junjie Wang (PyTorch)	0a6ac31797	[PT-D][DDP][BE] Add unit tests for Forward and Backward Hook (#74063 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74063 Address the issue https://github.com/pytorch/pytorch/issues/66229 as part of BE effort. Basically: 1. We remove the stale comment which confuses users. 2. Add more unit tests to test the forward/backward hook working for DDP. ghstack-source-id: 151463380 Test Plan: CI Reviewed By: rohan-varma Differential Revision: D34800830 fbshipit-source-id: 21133209323b2b5eda0dd6472f6309d4fb779b97 (cherry picked from commit b9b165c8305572128395daffafc13fcac8b85e29)	2022-03-16 23:18:28 +00:00
Shihao Xu	bcd0843bec	[torch.distributed][DDP] Disable DDP bucketing for the first iteration (#72843 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843 # [Debug Story] Training Hanging and DDP Bucketing What are the characteristics of the hanging training instance? The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution. The model config difference to trigger this hanging issue is turning on position weighted embedding tables. A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli). What was the initial manifested error? The training was stuck in the first iteration. What are useful debugging tools this time? After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure. After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all. With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks. What is DDP bucketing? Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks. Say we have 4 tensor ops. A, B, C, D. In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready. The time sequence would be, * D.grad * C.grad * B.grad * A.grad * All reduce on [D.grad, C.grad, B.grad, A.grad]. But that would be a huge waste of communication channel bandwidth. With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes, * D.grad * C.grad * All reduce on [D.grad, C.grad]. * B.grad * A.grad * All reduce on [B.grad, A.grad]. With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance. What exactly went wrong in this case? 1. The bucketing doesn’t honor backward graph execution order. 2. There are other collectives comm ops in backward graph. 3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world. Using the above example again, we have 4 tensor ops. A, B, C, D. Say we have 2 trainers, B is the feature processor module. B only runs on trainer 0 (both forward and backward), but not on trainer1. C is the All-to-all (Pooled embeddings distribution). C sends out all-to-all collective in both its forward and backward pass. Keep assuming all other ops run on both trainers. trainer_0 op sequence is, A, B (feature preproc), C (all-to-all), D \| D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad trainer_1 op sequence is, A, C (all-to-all), D \| D.grad, C.grad (reverse all-to-all), A.grad Even though the correct bucketing should be (same bucketing for both ranks), * bucket_0, [D.grad, C.grad] * bucket_1, [B.grad, A.grad] but because of 1), they end up like, * bucket_0, [B.grad, D.grad] * bucket_1, [C.grad, A.grad] Plus 2) and 3), the time sequence could like, (check mark represents the gradient is ready) (bucket is ready to do synchronization if all its enclosing gradients are ready) * trainer_0 * t0, * D.grad * bucket_0, [B.grad, D.grad ✓] * t1, * C.grad all-to-all * C.grad ✓ * bucket_1, [C.grad ✓, A.grad] * t2 * B.grad * bucket_0, [B.grad ✓, D.grad ✓] ✓ * t3 * All-reduce for bucket_0 * t4 * A.grad * bucket_1, [C.grad ✓, A.grad ✓] ✓ * trainer_1 * t0, * D.grad * bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.) * t1, * All-reduce for bucket_0 * t2 * C.grad all-to-all * bucket_1, [C.grad ✓, A.grad] * t3 * A.grad * bucket_1, [C.grad ✓, A.grad ✓] ✓ This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce. What is the solution for fixing DDP? Disable DDP bucketing for the first iteration. D34051938 This is because after the first iteration, buckets will be built again based on real backward graph execution order. So the slow gradient synchronization only affects the first iteration. Test Plan: buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu P484179296 buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes P484177200 Reviewed By: zhaojuanmao Differential Revision: D34051938 fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9 (cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)	2022-03-04 18:29:36 +00:00
Can Balioglu	e1db2f13ce	Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166 This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started. ghstack-source-id: 149778566 Test Plan: Run the existing unit tests. Reviewed By: rohan-varma Differential Revision: D34371226 fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b (cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)	2022-02-24 02:33:05 +00:00
Andrew Gu	59dd84cab6	[Join][BE] Fix typo; remove obsolete method (#72886 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886 Test Plan Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used. Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D34255651 Pulled By: awgu fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d (cherry picked from commit `4492f03a3f`)	2022-02-16 15:03:09 +00:00
Yuxin Wu	1ed4653e89	Stop writing logs to root logger (#72649 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/72648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/72649 Reviewed By: soulitzer Differential Revision: D34172113 Pulled By: mrshenli fbshipit-source-id: 98cb4140b978a0d9fa53876e427ea3b8bbe884cf (cherry picked from commit `c14297cee6`)	2022-02-11 21:30:53 +00:00
Rohan Varma	4feef6c970	Log static graph in constructor if it is set (#72456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456 It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug. ghstack-source-id: 148840679 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34045204 fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc (cherry picked from commit `1d622c88f3`)	2022-02-11 15:55:09 +00:00
Rohan Varma	37651894f9	[Easy] Small DDP fixes (#72455 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72455 - Improve helper function - Improve/fix some logging ghstack-source-id: 148840678 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D34044865 fbshipit-source-id: d2ae820effaaaecdd7155ffa8d3a1d8ebbd9f39e (cherry picked from commit `3efbea8f41`)	2022-02-11 15:55:09 +00:00

1 2 3 4 5 ...

385 Commits