Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.
Test Plan: CI.
Differential Revision: D51825114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
Fixes#112639
```txt
torch/utils/_sympy/value_ranges.py
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
D103: Missing docstring in public function
28
torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
9
torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
5
torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D400: First line should end with a period (not '`')
3
0
torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
D103: Missing docstring in public function
3
torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3
torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
D103: Missing docstring in public function
2
torch/distributed/run.py
torch/distributed/run.py:9 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
D103: Missing docstring in public function
7
torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
1
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.
By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
for i in range(num_coll):
dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
for i in range(num_coll):
dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.
In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.
### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us
Hence a 4x reduction in CPU overhead (dependent on `num_coll`).
Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.
By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
for i in range(num_coll):
dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
for i in range(num_coll):
dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.
In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.
### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us
Hence a 4x reduction in CPU overhead (dependent on `num_coll`).
Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330
### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.
### Changes
#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.
#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`
### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig
### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)
# Example
This is a basic script (using 2 backends within a process group)
```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os
if __name__ == "__main__":
rank = os.environ.get("RANK")
# initialize with both gloo and nccl
dist.init_process_group()
# with gloo
dist.all_reduce(torch.tensor([1.0]))
print(f"Rank {rank} finished")
# with nccl
dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```
Test Plan: Imported from OSS
Differential Revision: D42069829
Pulled By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8
Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`
Sample script to demonstrate new error type
```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py
import torch
import torch.distributed as dist
if __name__ == "__main__":
dist.init_process_group("nccl")
dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```
Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
Summary: Fix the typing stub of `ProcessGroup` in "torch/distributed/__init__.py", so that it won't confuse pyre, and we can remove a lot of pyre suppression comments.
Test Plan: pyre check
Differential Revision: D40921667
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88281
Approved by: https://github.com/wanchaol
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643
1. Add a decorator function exception_handlers to c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.
```
python3 test/distributed/test_c10d_error_logger.py
```
Test Plan: Test in OSS.
Reviewed By: H-Huang
Differential Revision: D40281632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
Move a bunch of globals to instance methods and replace all use to them.
We move all PG related globals under World and use a singleton instance under _world.
This creates an undocumented extension point to inject full control of how how c10d
state behaves.
One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.
It almost get DDP working and the PG is missing an implementation of all_reduce.
This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68
This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
Summary:
The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.
Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...
TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.
Test Plan: existing unit tests
Differential Revision: D39842273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
Approved by: https://github.com/mrshenli
- Support storing SymFloat in IValue
- Add SymFloat to JIT type system (erases to float)
- Printing support for SymFloat
- add/sub/mul/truediv operator support for SymFloat
- Support truediv on integers, it returns a SymFloat
- Support parsing SymFloat from Python object
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85411
Approved by: https://github.com/albanD
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372
This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time.
This PR does not yet enable RPC commands between ranks.
Previously:
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", world_size=1, rank=0)
```
Now (only rank is needed):
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", rank=0)
```
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34621651
Pulled By: H-Huang
fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0
(cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166
This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566
Test Plan: Run the existing unit tests.
Reviewed By: rohan-varma
Differential Revision: D34371226
fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66338
This commit exposes c10d extension API to Python land. Users can
now override c10d communication behaviors in pure Python, and no
longer needs to go through the cpp extension steps.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D31514351
Pulled By: mrshenli
fbshipit-source-id: a8b94af0af7960c078e1006c29b25f7f3bd86c81
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113
Since there is only one model replica per process, `replicas`
can be simplified from `std::vector<std::vector<at::Tensor>>` to
`std::vector<at::Tensor>` in the Reducer class.
Test Plan:
All tests are passing
`pytest test/distributed/test_c10d_gloo.py -vs`
Imported from OSS
Reviewed By: mrshenli
Differential Revision: D30615965
fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927
As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.
Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.
Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id: 135534086
Test Plan:
1) unit tests
2) waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D30170689
fbshipit-source-id: 1ac2e81c7a597dc40bf3fbf2c1168c382c66649f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111
Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.
Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.
Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.
#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed
Reviewed By: rohan-varma
Differential Revision: D28495672
fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224
Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.
Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.
This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.
Once all of this passes we simply dispatch the collective to the underlying pg.
Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687
Test Plan: ci
Reviewed By: zhaojuanmao
Differential Revision: D28023981
fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
Summary:
Generally wildcard imports are bad for the reasons described here: https://www.flake8rules.com/rules/F403.html
This PR replaces wildcard imports with an explicit list of imported items where possible, and adds a `# noqa: F403` comment in the other cases (mostly re-exports in `__init__.py` files).
This is a prerequisite for https://github.com/pytorch/pytorch/issues/55816, because currently [`tools/codegen/dest/register_dispatch_key.py` simply fails if you sort its imports](https://github.com/pytorch/pytorch/actions/runs/742505908).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55838
Test Plan: CI. You can also run `flake8` locally.
Reviewed By: jbschlosser
Differential Revision: D27724232
Pulled By: samestep
fbshipit-source-id: 269fb09cb4168f8a51fd65bfaacc6cda7fb87c34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353
Remove all the code branches that will only be executed when `device_ids > 1`.
Some helper functions are also removed:
1. `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`
The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27552201
fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52887
This diff changes the way to do model consistency check (i.e. `_verify_replicas_across_processes`) in DDP.
There were a few things that could be improved with the way we verify model across processes in DDP initialization:
1. We should do this check before syncing module states in DDP init, otherwise with Gloo backend this will throw but we would like to throw the error corresponding to different models on different ranks. To do this, we move the methods to be standalone C++ functions (not part of reducer) and move this check to before synchronizing parameters.
2. Refactor DDP init in the following ways:
- Run model consistency check before creating reducer, 2
- add helper functions to build params to pass into reducer
- add helper function to call `_verify_model_across_ranks`
- move `def parameters` to a helper function `_get_parameters` to be used more broadly within DDP
In follow up changes we will add the ability to detect which rank had inconsistent model (https://github.com/pytorch/pytorch/issues/52876 would be useful for this to determine which ranks(s) had errors).
ghstack-source-id: 123171877
Test Plan:
CI/unittest
buck test mode/dev-nosan //caffe2/test/distributed:c10d
BACKEND="nccl" WORLD_SIZE="2" ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_model_diff_across_ranks
Reviewed By: zhaojuanmao
Differential Revision: D26565290
fbshipit-source-id: f0e1709585b53730e86915e768448f5b8817a608
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52481
Adds an API `get_debug_mode` that can be used by distributed package and users to retrieve debug mode. Currently no functionality changes, but wanted to get the bare bones function out and add relevant debug mode logging in follow up diffs.
ghstack-source-id: 122471216
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D26508972
fbshipit-source-id: d1153774f8697bc925a05db177d71c0566d25344