https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.
As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
# Motivation
If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:
```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```
This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.
# Proposal
As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
Fixes#112604
Fixes docstring by following `pydocstyle` outputs.
- torch/nn/parallel/distributed.py
Before: 84
```
torch/nn/parallel/distributed.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parallel/distributed.py:92 in private function `_cast_buffers`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
torch/nn/parallel/distributed.py:143 in private function `_find_tensors`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
D400: First line should end with a period (not 'n')
torch/nn/parallel/distributed.py:633 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
D401: First line should be in imperative mood (perhaps 'Fire', not 'Fires')
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D400: First line should end with a period (not ':')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
D401: First line should be in imperative mood (perhaps 'Initialize', not 'Initialization')
torch/nn/parallel/distributed.py:1146 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1154 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D400: First line should end with a period (not 'o')
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
D401: First line should be in imperative mood (perhaps 'Assign', not 'Assigns')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
D403: First word of the first line should be properly capitalized ('Torchdynamo', not 'TorchDynamo')
torch/nn/parallel/distributed.py:1517 in public method `forward`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1527 in public method `scatter`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1530 in public method `to_kwargs`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1539 in public method `gather`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1542 in public method `train`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1617 in public method `join`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1617 in public method `join`:
D400: First line should end with a period (not 'f')
torch/nn/parallel/distributed.py:1617 in public method `join`:
D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D400: First line should end with a period (not 'y')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1752 in public method `join_device`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1756 in public method `join_process_group`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Allow', not 'Allows')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:2005 in public method `will_sync_module_buffers`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D400: First line should end with a period (not 'r')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D400: First line should end with a period (not 'g')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
D401: First line should be in imperative mood; try rephrasing (found 'It')
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
84
```
After: 12
```
torch/nn/parallel/distributed.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parallel/distributed.py:618 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:1133 in public method `__getstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1141 in public method `__setstate__`:
D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1503 in public method `forward`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1513 in public method `scatter`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1516 in public method `to_kwargs`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1525 in public method `gather`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1528 in public method `train`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1734 in public method `join_device`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1738 in public method `join_process_group`:
D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1986 in public method `will_sync_module_buffers`:
D102: Missing docstring in public method
12
```
- torch/nn/utils/_named_member_accessor.py
Before: 23
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
D400: First line should end with a period (not 's')
torch/nn/utils/_named_member_accessor.py:115 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/utils/_named_member_accessor.py:122 in public method `get_submodule`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:155 in public method `swap_submodule`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:164 in public method `get_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:185 in public method `set_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:194 in public method `del_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:211 in public method `swap_tensor`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:224 in public method `get_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:233 in public method `set_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:249 in public method `set_tensors_dict`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:261 in public method `del_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:276 in public method `swap_tensors`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:296 in public method `swap_tensors_dict`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:325 in public method `check_keys`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:340 in public method `named_parameters`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:349 in public method `named_buffers`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:358 in public method `named_tensors`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:368 in public method `named_modules`:
D200: One-line docstring should fit on one line with quotes (found 3)
23
```
After: 4
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:116 in public method `__init__`:
D107: Missing docstring in __init__
4
```
- torch/nn/utils/_per_sample_grad.py
Before: 3
```
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D400: First line should end with a period (not ')')
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
D402: First line should not be the function's "signature"
3
```
After: 0
```
0
```
- torch/nn/utils/init.py
Before: 3
```
torch/nn/utils/init.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/init.py:6 in public function `skip_init`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/init.py:6 in public function `skip_init`:
D400: First line should end with a period (not 'g')
3
```
After: 1
```
torch/nn/utils/init.py:1 at module level:
D100: Missing docstring in public module
1
```
- torch/nn/utils/memory_format.py
Before: 4
```
torch/nn/utils/memory_format.py:1 at module level:
D100: Missing docstring in public module
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
D400: First line should end with a period (not '`')
4
```
After: 1
```
torch/nn/utils/memory_format.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112657
Approved by: https://github.com/fduwjj
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
Adds support for multiple forward before bwd call for
static_graph=True.
There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.
Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487
Approved by: https://github.com/awgu
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.
As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.
Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
Summary: Disable buffers sync in _sync_module_states(...) when broadcast_buffers is False. This change will memory usage when a model has huge buffers and does not need broadcast buffers.
Test Plan: .
Differential Revision: D45610709
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100729
Approved by: https://github.com/mrshenli
Summary:
When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time.
As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP
Test Plan:
Unit test added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
Summary: Enable the functionality of delaying all reduce in DDP to specify the parameters whose all reduce will be hooked to a specific param. This prevents AllReduce blocking All2All in some recommendation models.
Test Plan: GitHub CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96673
Approved by: https://github.com/zhaojuanmao
Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows:
1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed.
2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously.
3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision.
4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves.
5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs.
6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback.
7. DDP Ignored parameters are not touched.
Follow-ups:
1. Unify comm hooks and make it work with apply optimizer in backward
2. implement keep_low_precision_grads,
3. allow BN, LN, or custom units to run in reduced precision,
4. support for cast_forward_inputs
5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs
6. Integrate this with replicate() API.
7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order.
8. Entirely unused modules probably don't need to be cast.
Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882
Approved by: https://github.com/zhaojuanmao
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
removes this unused var, the overall buffer comm hook feature is also not being used, we should deprecate / remove it as it is still a private API.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93128
Approved by: https://github.com/awgu
Allow _apply_optim_in_backward to work with DDP.
Example:
```
dist.init_process_group("nccl", rank=rank, world_size=2)
torch.cuda.set_device(rank)
e = enc().cuda(rank)
_apply_optimizer_in_backward(
optimizer_class=torch.optim.SGD,
params=e.parameters(),
optimizer_kwargs={"lr": 0.03},
)
e = DDP(e, device_ids=[rank])
inp = torch.randn(1, 10, device=rank)
e(inp).sum().backward()
```
Constraints:
1. Custom communication hook not yet supported
2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP.
3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used.
4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.)
Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194
Approved by: https://github.com/zhaojuanmao
This is a new version of #15648 based on the latest master branch.
Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.
In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)
Fixes https://github.com/pytorch/pytorch/issues/71105
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
### Description
Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.
### Testing
There shouldn't be any testing required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75753
As per the design in https://github.com/pytorch/pytorch/issues/72138,
convert DDP parameters to ReplicatedTensor during its forward pass. Concretely,
this is done as follows:
1) Create a separate `_replicated_tensor_module` which is a copy of self.module
without creating copies of the Tensors themselves.
2) Use `_replicated_tensor_module` instead of `self.module` during the forward
pass.
3) Have a context manager `_ddp_replicated_tensor` to enable this, since
certain edge cases can fail where self.module is changed out of band resulting
in discrepancy between self.module and `_replicated_tensor_module`.
Differential Revision: [D35533736](https://our.internmc.facebook.com/intern/diff/D35533736/)
Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74063
Address the issue https://github.com/pytorch/pytorch/issues/66229 as part of BE effort.
Basically:
1. We remove the stale comment which confuses users.
2. Add more unit tests to test the forward/backward hook working for DDP.
ghstack-source-id: 151463380
Test Plan: CI
Reviewed By: rohan-varma
Differential Revision: D34800830
fbshipit-source-id: 21133209323b2b5eda0dd6472f6309d4fb779b97
(cherry picked from commit b9b165c8305572128395daffafc13fcac8b85e29)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72843
# [Debug Story] Training Hanging and DDP Bucketing
**What are the characteristics of the hanging training instance?**
The model uses TorchRec `PooledEmbeddingArch` and corresponding sharding solution.
The model config difference to trigger this hanging issue is turning on position weighted embedding tables.
A feature processor module, `GroupedPositionWeightedModule`, is constructed on all ranks, but `GroupedPositionWeightedModule.foward(...)` is only [called on subset ranks of the whole world](https://fburl.com/code/yqrmtvli).
**What was the initial manifested error?**
The training was stuck in the first iteration.
**What are useful debugging tools this time?**
After turning off [static_graph in DDP](https://fburl.com/code/4io81p5i), we saw there were sparse feature lengths becoming negative values after all-to-all collectives. Hanging becomes fatal failure.
After turning on [torch.distributed DETAIL debugging mode](https://fburl.com/code/cp8e28mm), we saw 2 trainers sent out mismatched collectives, one doing all-to-all, the other doing all-reduce. So we know the negative values comes from all-to-all being matched with all-reduce. the error had happened ahead, which is the wrong timing of either doing all-reduce or all-to-all.
With more added loggings inside of DDP, it turned out the DDP decided to do all-reduce at different timings across different ranks.
**What is DDP bucketing?**
Once a gradient is ready on a rank, DDP uses all-reduce to synchronize the average of this gradient across all ranks.
Say we have 4 tensor ops. A, B, C, D.
In the most naive version, we could do one synchronization when all gradients in the full backward graph are ready.
The time sequence would be,
* D.grad
* C.grad
* B.grad
* A.grad
* All reduce on [D.grad, C.grad, B.grad, A.grad].
But that would be a huge waste of communication channel bandwidth.
With DDP bucketing, we could put ahead some gradient synchronization batch by batch. The above time sequence now becomes,
* D.grad
* C.grad
* All reduce on [D.grad, C.grad].
* B.grad
* A.grad
* All reduce on [B.grad, A.grad].
With gradient computation overlaps with communication, bucketing technique brings better DDP execution performance.
**What exactly went wrong in this case?**
1. The bucketing doesn’t honor backward graph execution order.
2. There are other collectives comm ops in backward graph.
3. There are unused parameters (i.e unused sub-module) in subset ranks of the whole world.
Using the above example again, we have 4 tensor ops. A, B, C, D.
Say we have 2 trainers,
B is the feature processor module.
B only runs on trainer 0 (both forward and backward), but not on trainer1.
C is the All-to-all (Pooled embeddings distribution).
C sends out all-to-all collective in both its forward and backward pass.
Keep assuming all other ops run on both trainers.
trainer_0 op sequence is,
A, B (feature preproc), C (all-to-all), D | D.grad, C.grad (reverse all-to-all), B.grad (feature proc grads), A.grad
trainer_1 op sequence is,
A, C (all-to-all), D | D.grad, C.grad (reverse all-to-all), A.grad
Even though the correct bucketing should be (same bucketing for both ranks),
* bucket_0, [D.grad, C.grad]
* bucket_1, [B.grad, A.grad]
but because of 1), they end up like,
* bucket_0, [B.grad, D.grad]
* bucket_1, [C.grad, A.grad]
Plus 2) and 3), the time sequence could like,
(check mark represents the gradient is ready)
(bucket is ready to do synchronization if all its enclosing gradients are ready)
* trainer_0
* t0,
* D.grad
* bucket_0, [B.grad, D.grad ✓]
* t1,
* **C.grad all-to-all**
* C.grad ✓
* bucket_1, [C.grad ✓, A.grad]
* t2
* B.grad
* bucket_0, [B.grad ✓, D.grad ✓] ✓
* t3
* All-reduce for bucket_0
* t4
* A.grad
* bucket_1, [C.grad ✓, A.grad ✓] ✓
* trainer_1
* t0,
* D.grad
* bucket_0, [B.grad ✓, D.grad ✓] ✓. (Because B is not used on trainer_1, DDP marks its gradient as ready immediately.)
* t1,
* **All-reduce for bucket_0**
* t2
* C.grad all-to-all
* bucket_1, [C.grad ✓, A.grad]
* t3
* A.grad
* bucket_1, [C.grad ✓, A.grad ✓] ✓
This is why trainer_0 all-to-all is matched up with trainer_1 all-reduce.
**What is the solution for fixing DDP?**
Disable DDP bucketing for the first iteration. D34051938
This is because after the first iteration, buckets will be built again based on real backward graph execution order.
So the slow gradient synchronization only affects the first iteration.
Test Plan:
buck build mode/dev-nosan caffe2/test/distributed:distributed_gloo_spawn
BACKEND=gloo WORLD_SIZE=3 buck-out/gen/caffe2/test/distributed/distributed_gloo_spawn\#binary.par -r test_ddp_logging_data_cpu
P484179296
buck build mode/dev-nosan caffe2/test/distributed:distributed_nccl_spawn
BACKEND=nccl WORLD_SIZE=2 buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn\#binary.par -r test_ddp_logging_data_cpu -r test_ddp_get_bucket_sizes
P484177200
Reviewed By: zhaojuanmao
Differential Revision: D34051938
fbshipit-source-id: 0c7f35875687095c3199f19990e73a8349b6e5b9
(cherry picked from commit bb8f11306ea51c2bd3ffd3ab001d62ce369a08ee)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72886
**Test Plan**
Searching for `_schedule_shadow_all_reduce_for_fwd_pass` shows that it is defined but never used.
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34255651
Pulled By: awgu
fbshipit-source-id: 205a0325c2cdc05e127a183cb86fa2fc2e0db99d
(cherry picked from commit 4492f03a3f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72456
It is easier to log if static graph is set at construction time now that it is natively supported in DDP constructor, as opposed to waiting for the first iteration to finish. In some failure cases we're seeing the first iteration does not finish and thus we don't have this data which is vaulable to debug.
ghstack-source-id: 148840679
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D34045204
fbshipit-source-id: 72a187c1ce031db217de4b3ad20a64f2a74995bc
(cherry picked from commit 1d622c88f3)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71608
Per title
ghstack-source-id: 147577178
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33696382
fbshipit-source-id: 5b638d3edf5f03ba476356d61e96ca604de18c8f
(cherry picked from commit 436b547fb0)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71606
Per title
ghstack-source-id: 147577172
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33694037
fbshipit-source-id: a148d5ce6031f0cc20f33785cfe2c27d1fc2d682
(cherry picked from commit ace3261e0c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71483
claify that peak memory saving should be checked after first iteration when using gradient_as_bucket_view
ghstack-source-id: 147271113
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D33662424
fbshipit-source-id: f760da38e166ae85234e526ddf1526269ea25d42
(cherry picked from commit a40dda20da)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459
1. add static_graph feature to DDP constructor;
2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor
3. four cases are covered:
static_graph = False, _set_static_graph() is called;
static_graph = False, _set_static_graph() is not called;
static_graph = True, _set_static_graph() is not called;
static_graph = True, _set_static_graph() is called;
ghstack-source-id: 147263797
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D33646738
fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855
(cherry picked from commit dc246a1129)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827
Add a note about current checkpoint support with DDP. Note that this
does not include the features enabled with _set_static_graph yet, as it is an
undocumented private API. Once we support static graph as beta feature in OSS
we can add to the note here.
ghstack-source-id: 144285041
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D32624957
fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f
Summary:
`default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality).
Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`).
Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it.
cc VitalyFedyunin ejguan NivekT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779
Reviewed By: wenleix
Differential Revision: D32651129
Pulled By: ejguan
fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4
Summary:
Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is
still not complete to enable bfloat16 for allreduce in end-to-end training.
This patch does the followings:
* fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in
v2.10.3-1 (commit 7e51592)
* update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL
operations like all reduce can use it
* enable unit tests for bfloat16 datatype if possible
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843
Reviewed By: H-Huang
Differential Revision: D32248132
Pulled By: mrshenli
fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680
Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models
with sync BN so we can find workflows that use them and target for perf
optimization.
ghstack-source-id: 140875182
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D31679477
fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015
Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of
tensors in DDPSink. Only applies once for static_graph and generally for unused
params which already has overhead, so perf hit should not be an issue. Will
verify with benchmark.
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D31346633
fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515
For performance reasons, we would like to ensure that we can await
user collectives as part of custom buffer reduction in parallel to other work.
As a result, add support to return futures from custom buffer hooks and await
those futures at end of backwards pass.
Also added some docs to clarify how to use these APIs.
ghstack-source-id: 138793803
Test Plan: I
Reviewed By: zhaojuanmao
Differential Revision: D30757761
fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44