Commit Graph

321 Commits

Author SHA1 Message Date
Pritam Damania
17e2313dd3 Add an API to DDP for dynamically updating the underlying process group. (#113580)
# Motivation

If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:

```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```

This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.

# Proposal

As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
2023-11-15 09:05:02 +00:00
wz337
f2963642c2 [DDP] Add device_mesh to DDP ctor (#112761)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112761
Approved by: https://github.com/fegin
2023-11-08 03:08:08 +00:00
Aaron Gokaslan
8219bf051b [BE]: Apply RUF015 to torch folder (#113025)
Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-07 00:48:15 +00:00
NVS Abhilash
db66f15785 docs: fix docstrings in distributed.py and others (fixes #112604) (#112657)
Fixes #112604

Fixes docstring by following `pydocstyle` outputs.

- torch/nn/parallel/distributed.py
Before: 84
```
torch/nn/parallel/distributed.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parallel/distributed.py:92 in private function `_cast_buffers`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:103 in private function `_setup_mixed_precision_params`:
        D401: First line should be in imperative mood (perhaps 'Create', not 'Creates')
torch/nn/parallel/distributed.py:143 in private function `_find_tensors`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:273 in private method `__init__`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:287 in private method `main_hook`:
        D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:324 in private method `post_hook`:
        D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:332 in public class `DistributedDataParallel`:
        D400: First line should end with a period (not 'n')
torch/nn/parallel/distributed.py:633 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:960 in private method `_fire_reducer_autograd_hook`:
        D401: First line should be in imperative mood (perhaps 'Fire', not 'Fires')
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:969 in private method `_root_copy_hook`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1012 in private method `_module_wait_for_copy_hook`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D400: First line should end with a period (not ':')
torch/nn/parallel/distributed.py:1050 in private method `_ddp_init_helper`:
        D401: First line should be in imperative mood (perhaps 'Initialize', not 'Initialization')
torch/nn/parallel/distributed.py:1146 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1154 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D400: First line should end with a period (not 'o')
torch/nn/parallel/distributed.py:1222 in private method `_assign_modules_buffers`:
        D401: First line should be in imperative mood (perhaps 'Assign', not 'Assigns')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:1277 in private method `_get_parameters`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1312 in public method `no_sync`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:1340 in private method `_get_active_ddp_module`:
        D403: First word of the first line should be properly capitalized ('Torchdynamo', not 'TorchDynamo')
torch/nn/parallel/distributed.py:1517 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1527 in public method `scatter`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1530 in public method `to_kwargs`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1539 in public method `gather`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1542 in public method `train`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D400: First line should end with a period (not 'f')
torch/nn/parallel/distributed.py:1617 in public method `join`:
        D401: First line should be in imperative mood; try rephrasing (found 'A')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D400: First line should end with a period (not 'y')
torch/nn/parallel/distributed.py:1723 in public method `join_hook`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:1752 in public method `join_device`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1756 in public method `join_process_group`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:1765 in private method `_register_buffer_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Allow', not 'Allows')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1805 in public method `register_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D400: First line should end with a period (not 'P')
torch/nn/parallel/distributed.py:1887 in private method `_register_builtin_comm_hook`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D400: First line should end with a period (not 'a')
torch/nn/parallel/distributed.py:1914 in private method `_register_fused_optim`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/nn/parallel/distributed.py:2005 in public method `will_sync_module_buffers`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2060 in private method `_default_broadcast_coalesced`:
        D400: First line should end with a period (not 'e')
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2128 in private method `_get_data_parallel_params`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D400: First line should end with a period (not 'r')
torch/nn/parallel/distributed.py:2141 in private method `_set_params_and_buffers_to_ignore_for_model`:
        D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D400: First line should end with a period (not 's')
torch/nn/parallel/distributed.py:2170 in private method `_get_ddp_logging_data`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D400: First line should end with a period (not 'g')
torch/nn/parallel/distributed.py:2184 in private method `_set_ddp_runtime_logging_sample_rate`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D400: First line should end with a period (not 'l')
torch/nn/parallel/distributed.py:2202 in private method `_set_static_graph`:
        D401: First line should be in imperative mood; try rephrasing (found 'It')
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/parallel/distributed.py:2227 in private method `_remove_autograd_hooks`:
        D401: First line should be in imperative mood (perhaps 'Remove', not 'Removes')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D400: First line should end with a period (not 'd')
torch/nn/parallel/distributed.py:2233 in private method `_check_reducer_finalized`:
        D401: First line should be in imperative mood (perhaps 'Check', not 'Checks')
84
```

After: 12
```
torch/nn/parallel/distributed.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/parallel/distributed.py:618 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/parallel/distributed.py:1133 in public method `__getstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1141 in public method `__setstate__`:
        D105: Missing docstring in magic method
torch/nn/parallel/distributed.py:1503 in public method `forward`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1513 in public method `scatter`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1516 in public method `to_kwargs`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1525 in public method `gather`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1528 in public method `train`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1734 in public method `join_device`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1738 in public method `join_process_group`:
        D102: Missing docstring in public method
torch/nn/parallel/distributed.py:1986 in public method `will_sync_module_buffers`:
        D102: Missing docstring in public method
12
```

- torch/nn/utils/_named_member_accessor.py
Before: 23
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:109 in public class `NamedMemberAccessor`:
        D400: First line should end with a period (not 's')
torch/nn/utils/_named_member_accessor.py:115 in public method `__init__`:
        D107: Missing docstring in __init__
torch/nn/utils/_named_member_accessor.py:122 in public method `get_submodule`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:155 in public method `swap_submodule`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:164 in public method `get_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:185 in public method `set_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:194 in public method `del_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:211 in public method `swap_tensor`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:224 in public method `get_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:233 in public method `set_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:249 in public method `set_tensors_dict`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:261 in public method `del_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:276 in public method `swap_tensors`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:296 in public method `swap_tensors_dict`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_named_member_accessor.py:325 in public method `check_keys`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:340 in public method `named_parameters`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:349 in public method `named_buffers`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:358 in public method `named_tensors`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/nn/utils/_named_member_accessor.py:368 in public method `named_modules`:
        D200: One-line docstring should fit on one line with quotes (found 3)
23
```

After: 4
```
torch/nn/utils/_named_member_accessor.py:12 in public function `set_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:29 in public function `swap_tensor`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:85 in public function `swap_submodule`:
        D103: Missing docstring in public function
torch/nn/utils/_named_member_accessor.py:116 in public method `__init__`:
        D107: Missing docstring in __init__
4
```

- torch/nn/utils/_per_sample_grad.py
Before: 3
```
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D400: First line should end with a period (not ')')
torch/nn/utils/_per_sample_grad.py:12 in public function `call_for_per_sample_grads`:
        D402: First line should not be the function's "signature"
3
```
After: 0
```
0
```

- torch/nn/utils/init.py
Before: 3
```
torch/nn/utils/init.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/init.py:6 in public function `skip_init`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/init.py:6 in public function `skip_init`:
        D400: First line should end with a period (not 'g')
3
```
After: 1
```
torch/nn/utils/init.py:1 at module level:
        D100: Missing docstring in public module
1
```

- torch/nn/utils/memory_format.py
Before: 4
```
torch/nn/utils/memory_format.py:1 at module level:
        D100: Missing docstring in public module
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D202: No blank lines allowed after function docstring (found 1)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D205: 1 blank line required between summary line and description (found 0)
torch/nn/utils/memory_format.py:5 in public function `convert_conv2d_weight_memory_format`:
        D400: First line should end with a period (not '`')
4
```
After: 1
```
torch/nn/utils/memory_format.py:1 at module level:
        D100: Missing docstring in public module
1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112657
Approved by: https://github.com/fduwjj
2023-11-02 05:52:47 +00:00
Oleg Bulatov
192477b5ba Enable flake8-bugbear B020 lint (#110823)
Fixes part of https://github.com/pytorch/pytorch/issues/106571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110823
Approved by: https://github.com/Skylion007
2023-10-24 22:43:47 +00:00
Rohan Varma
24e5d61af8 Log usage of optimizer in backward (#110206)
This will allow us to inspect and aggregate jobs that use optimizer in
backward

Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206
Approved by: https://github.com/awgu
2023-09-29 11:00:07 +00:00
Andrei Gheorghe
6275f91654 Improved DDP checkpoint documentation (#106985)
Amended the documentation for the specified case.

Fixes #84589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106985
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-09-25 22:54:24 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
Rohan Varma
c11412b4a8 [DDP] Support optim in backward after DDP init (#105995)
This allows in backward optimizers to be configured after DDP init, in
addition to before as was previously supported.

Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995
Approved by: https://github.com/fegin
2023-07-29 01:36:25 +00:00
Justin Chu
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
Animesh Jain
0444f9f85b [dynamo] Reland #104317 - Lazy disable_dynamo API out-of-dynamo (#104664)
Internal failed because of torch.deploy issues with disable_dynamo in fx/* and _jit/* files. Removing disable_dynamo for both. Added a comment in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104664
Approved by: https://github.com/wconstab
2023-07-06 00:48:02 +00:00
PyTorch MergeBot
54e320d4d1 Revert "[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317)"
This reverts commit 5c12a810ac.

Reverted https://github.com/pytorch/pytorch/pull/104317 on behalf of https://github.com/huydhn due to This has been reverted internally by D47166892, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104317#issuecomment-1621099151))
2023-07-05 06:21:48 +00:00
Animesh Jain
5c12a810ac [dynamo] Lazy disable_dynamo API out-of-dynamo (#104317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104317
Approved by: https://github.com/jansel, https://github.com/wconstab, https://github.com/mlazos
2023-06-29 13:30:17 +00:00
Howard Huang
9165d46b89 DDP + C10D sparse all_reduce changes (#103916) (#104256)
Summary:

reland of https://github.com/pytorch/pytorch/pull/103916

## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D47056695

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256
Approved by: https://github.com/rohan-varma
2023-06-28 00:37:52 +00:00
PyTorch MergeBot
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
Howard Huang
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
Rohan Varma
f044613f78 Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938)
Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-22 21:55:58 +00:00
Huy Do
b1ddd5a293 Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)
Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack.  This seems like a safer option than using the bot as the commit has already been in trunk since last week.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873
Approved by: https://github.com/rohan-varma
2023-06-20 16:25:00 +00:00
Rohan Varma
80139fc2db [DDP] multiple forward support for static graph (#103487)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
Rohan Varma
780b24b27c [DDP] Refactor _DDPSink to take DDP weakref (#103304)
This will make future PRs to support DDP static graph multi forward
cleaner.

Differential Revision: [D46584545](https://our.internmc.facebook.com/intern/diff/D46584545/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103304
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
Rohan Varma
a3a32c1be0 [DDP] Rename num_iterations -> num_forward_calls (#103283)
This more accurately represents what we're counting. At iteration is a
forward + backward call, but here we're just counting forward calls. This makes
things less confusing in future diffs where we support DDP static graph
multiple forwards.

Differential Revision: [D46580601](https://our.internmc.facebook.com/intern/diff/D46580601/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103283
Approved by: https://github.com/awgu
2023-06-14 16:14:50 +00:00
Rohan Varma
2076a2ffa7 [DDP] Rename state_dict var to ddp_state (#103282)
This name is confusing in the context that it is just a dictionary
used to pass state to DDP backward pass.

Differential Revision: [D46580516](https://our.internmc.facebook.com/intern/diff/D46580516/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103282
Approved by: https://github.com/awgu
2023-06-14 16:14:49 +00:00
Rohan Varma
88ce6215f5 [FSDP/DDP] Unify _cast_forward_inputs (#102680)
Closes https://github.com/pytorch/pytorch/issues/96380

Differential Revision: [D46342814](https://our.internmc.facebook.com/intern/diff/D46342814/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102680
Approved by: https://github.com/awgu
2023-06-04 18:31:21 +00:00
Pritam Damania
9a2df0a5af [RFC] Add method to DDP to check for backward finalization. (#100773)
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.

As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.

Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
2023-05-31 20:43:06 +00:00
Matthew Hoffman
c28f8e314d Add type hints in torch/distributed/utils.py (#102262)
Fixes #77190

Pretty similar to the typing in `torch/nn/parallel`, which was also improved recently: #102194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102262
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze
2023-05-30 19:57:45 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
Xing Liu
0731420645 [PyTorch/Distributed]Only sync buffers when broadcast_buffers is True (#100729)
Summary: Disable buffers sync in _sync_module_states(...) when broadcast_buffers is False. This change will memory usage when a model has huge buffers and does not need broadcast buffers.

Test Plan: .

Differential Revision: D45610709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100729
Approved by: https://github.com/mrshenli
2023-05-08 16:34:29 +00:00
Rohan Varma
87db02ea38 [DDP] Perform input casting in pre forward (#100131)
This is so that replicate can also have the feature to cast its
inputs, which it currently does not. Next diff will change replicate pre hook
to support this.

Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131
Approved by: https://github.com/zhaojuanmao
2023-04-27 17:34:46 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Rohan Varma
bba2090831 Enable fused optimizer for DP (#98270)
Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270
Approved by: https://github.com/awgu
2023-04-13 20:16:32 +00:00
Kazuaki Ishizaki
a531a464fd Fix typos under torch/nn directory (#97594)
This PR fixes typos in comments of `.py` files under `torch/nn` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97594
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-04-10 22:07:15 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
feifan
d95ee64b58 ddp forward support custom backend. (#98283)
Currently DDP only considers CUDA backend,DDP forward will transfer tensor to CUDA. We want ddp to run on custom backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98283
Approved by: https://github.com/ezyang
2023-04-09 01:30:42 +00:00
Sergii Dymchenko
477f3f555f Simplify by using yield from (#97831)
The issues were found by SIM104 flake8-simplify in a local run.

I'll take a look on adding the check to the CI separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97831
Approved by: https://github.com/Skylion007
2023-03-29 19:15:24 +00:00
Charlie Yan
44e73db3c2 [2/n] Consolidate replicate and DDP: split forward function (#96658)
Split `forward` function into `pre_forward` and `post_forward`, so that they can be reused in the composable API of `replicate`.

Differential Revision: [D44377456](https://our.internmc.facebook.com/intern/diff/D44377456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96658
Approved by: https://github.com/rohan-varma
2023-03-29 13:57:16 +00:00
Pritam Damania
e20e5f5578 [RFC] Add an API to remove autograd hooks from DDP (#96490)
Summary:
When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time.

As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP

Test Plan:
Unit test added

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-03-21 02:56:16 +00:00
Charlie Yan
13538c88b3 [1/n] Consolidate replicate and DDP: setup ufmt for distributed.py (#96597)
As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597
Approved by: https://github.com/rohan-varma
2023-03-17 06:25:11 +00:00
Rohan Varma
71adb32ddc [DDP] API to get data parallel parameters (#95097)
Add a private API to retrieve data parallel parameters. This is
useful for example for apply_optimizer_in_backward in the case user wishes to
ensure it is applied only on DDP managed parameters.

Differential Revision: [D43383878](https://our.internmc.facebook.com/intern/diff/D43383878/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95097
Approved by: https://github.com/zhaojuanmao, https://github.com/fegin
2023-03-16 00:30:37 +00:00
Xinfeng
906a1952c6 [DDP] Enable delayed all reduce in DDP (#96673)
Summary: Enable the functionality of delaying all reduce in DDP to specify the parameters whose all reduce will be hooked to a specific param. This prevents AllReduce blocking All2All in some recommendation models.

Test Plan: GitHub CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96673
Approved by: https://github.com/zhaojuanmao
2023-03-14 04:25:25 +00:00
Rohan Varma
32f11f58c9 DDP native mixed precision (#92882)
Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows:

1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed.
2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously.
3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision.
4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves.
5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs.
6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback.
7. DDP Ignored parameters are not touched.

Follow-ups:

1. Unify comm hooks and make it work with apply optimizer in backward
2. implement keep_low_precision_grads,
3. allow BN, LN, or custom units to run in reduced precision,
4. support for cast_forward_inputs
5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs
6. Integrate this with replicate() API.
7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order.
8. Entirely unused modules probably don't need to be cast.

Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882
Approved by: https://github.com/zhaojuanmao
2023-03-13 14:10:31 +00:00
fduwjj
a88bfc60c7 [2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450)
No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450
Approved by: https://github.com/wanchaol
2023-02-26 03:03:37 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Aaron Gokaslan
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
Aaron Gokaslan
1e2d82b8e4 [BE] Merge isinstance calls together (#94419)
Simplify and speeds up isinstance calls by checking for multiple types at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419
Approved by: https://github.com/ezyang
2023-02-09 00:47:26 +00:00
Aaron Gokaslan
748bac8757 [BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309)
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
2023-02-07 20:08:58 +00:00
Rohan Varma
264c89658b Move in backward opt setup to helper (#92059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92059
Approved by: https://github.com/awgu
2023-02-02 23:57:14 +00:00
Rohan Varma
975feb606e [DDP][Easy] Remove unused var (#93128)
removes this unused var, the overall buffer comm hook feature is also not being used, we should deprecate / remove it as it is still a private API.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93128
Approved by: https://github.com/awgu
2023-01-27 18:08:29 +00:00