Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70165
Implements activation offload support in checkpoint_wrapper API via
save_on_cpu hooks. We avoid modifying the torch.utils.checkpoint implementation
and instead compose offload + checkpoint using the save_on_cpu hook for the
former.
ghstack-source-id: 146078900
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D33228820
fbshipit-source-id: 98b4da0828462c41c381689ee07360ad014e808a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70164
Implement Alban's suggestion to make checkpoint_wrapper an nn.Module
instead of patching the forward pass, which is too hacky.
ghstack-source-id: 146011215
Test Plan: IC
Reviewed By: mrshenli
Differential Revision: D33214696
fbshipit-source-id: dc4b3e928d66fbde828ab60d90b314a8048ff7a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69955
Implements a checkpoint_wrapper function, which wraps nn.Module with checkpointing so user won't have to call checkpoint() everytime they want to checkpoint the module.
Currently only support for reentrant-based checkpointing is added and only tested with FSDP to unblock a use case.
Future work is to add support for new checkpointing API, add more tests, upstream to torch.utils.checkpoint.
ghstack-source-id: 145811242
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D33107276
fbshipit-source-id: c4a1c68d71d65713a929994940a8750f73fbdbdb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67401
some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 143910067
ghstack-source-id: 143910067
Test Plan: wait for ci
Reviewed By: mrshenli
Differential Revision: D31979269
fbshipit-source-id: 85a2f395e6a3487dd0b9d1fde886eccab106e289
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66649
some minor changes to dist quantization, mainly change the namespace and add some notes for future code dedup
ghstack-source-id: 141336191
Test Plan: wait for ci
Reviewed By: cbalioglu
Differential Revision: D31663043
fbshipit-source-id: 2f96b7346e9c90df5ab2536767f8301eb86a9c79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65197
1. The constructor accepts a local optimizer instance instead of the inputs of local optimizer constructor and the class type.
2. The parameters are read from local optimizer's param_groups instead of a separate input.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 138307226
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity
Reviewed By: rohan-varma
Differential Revision: D31007439
fbshipit-source-id: bbb0526e6763ef76775b85088571506b3942c722
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63154
The collective quantization API now supports alltoall, alltoall_single, and allscatter. The test is also included.
ghstack-source-id: 136856877
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/algorithms/quantization:DistQuantizationTests_nccl -- test_all_to_all_single_bfp16
Reviewed By: wanchaol
Differential Revision: D30255251
fbshipit-source-id: 856f4fa12de104689a03a0c8dc9e3ecfd41cad29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63059
Supporting BFP16 quantization method to OSS. Currently only support CPU
ghstack-source-id: 136639528
Test Plan: Imported from OSS
Reviewed By: wanchaol
Differential Revision: D30194538
fbshipit-source-id: ac248567ad8028457c2a91b77ef2ce81709fce53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63895
When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.
The replacement is `tensor.set_`.
ghstack-source-id: 136593433
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_post_localSGD_optimizer_parity
Reviewed By: SciPioneer
Differential Revision: D30526178
fbshipit-source-id: a1ac0ec3665d8623edd5bf94f01c1132daff5c00
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63867
When updating the model parameter, updating `parameter.data` is no longer recommended, because this `data` field will be deprecated in the future.
The replacement is `tensor.set_`.
ghstack-source-id: 136531233
Test Plan: buck test mode/dev-nosan //caffe2/test/distributed:distributed_nccl_spawn -- test_periodic_model_averager
Reviewed By: SciPioneer
Differential Revision: D30513613
fbshipit-source-id: 402efb9c30fafc3f285bebc631639f656ceae585
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260
Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7.
Reviewed By: SciPioneer
Differential Revision: D30238317
fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63277
`PostLocalSGDState` requires a subgroup. To initialize this subgroup, a global process group must be initialized. However, this imposes a restriction that a hook state can only be provided after distributed environment initialization, which is not compatible with lightning DDP plugin setup where hook state should be provided before distributed environment initialization.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 135848575
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD
Reviewed By: cbalioglu
Differential Revision: D30325041
fbshipit-source-id: 7b870166d096d306c3f2f7c69816a705cec0bebd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62142
Created wrapper that takes the collective op and a quantization type as an arguments. It quantize the input, performs the collective op, and and perform dequantization
Test Plan:
Tested through distributed_gloo_fork.
e.g., buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_all_to_all_quantized
Reviewed By: wanchaol
Differential Revision: D29682812
fbshipit-source-id: 79c39105ff11270008caa9f566361452fe82a92e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605
Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing.
`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```
Reviewed By: mrshenli
Differential Revision: D30055544
Pulled By: andwgu
fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592
Reland #62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
ghstack-source-id: 134848352
Test Plan: unit test
Reviewed By: andwgu
Differential Revision: D30049431
fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62532
This method is not stable at this time, so avoid releasing it when DDP communication hook feature is released as a stable feature.
ghstack-source-id: 134787831
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_hook_with_optimizer_parity
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_hook_then_optimizer_nccl
Reviewed By: rohan-varma
Differential Revision: D30031222
fbshipit-source-id: e03a8e13fee5116a5ddd724eb76316ee98f2a676
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510
`GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity:
1) get_index -> index
2) is_the_last_bucket_to_allreduce -> is_last,
3) get_per_parameter_tensors -> gradients,
4) get_model_params_for_bucket -> parameters.
Test Plan:
Local run comprehensive test with following results:
https://pxl.cl/1Ml8b
For two timeout failure test cases, most likely environment related and fail in my devserver.
Reviewed By: SciPioneer
Differential Revision: D30024161
fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f
Summary:
**Overview:**
This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration.
Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157
Test Plan:
The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass:
- ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`)
- `test_ddp_with_zero_step_parity_gpu`
- `test_ddp_with_zero_step_interleaved_parity_gpu`
These were tested on the AI AWS cluster.
An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302.
Both approaches have been verified using an internal accuracy benchmark.
Reviewed By: mrshenli
Differential Revision: D29971046
Pulled By: andwgu
fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457
Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.
Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type
Reviewed By: rohan-varma
Differential Revision: D30007390
fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62389
Simplify the implementation of `parseHookResult` of `PythonCommHook`, by not directly accepting the output of allreduce, which is a tensor list.
Address the comment on https://github.com/pytorch/pytorch/pull/62074#discussion_r675303280
Additionally, formatter is also applied to `OptimizerHookState` and `hook_then_optimizer`.
ghstack-source-id: 134626246
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork
Reviewed By: rohan-varma
Differential Revision: D29982485
fbshipit-source-id: 5b27cc5ef09d2f87c1ade4c0feef7eacc1af3a9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62392
The constructor of `PeriodicModelAverager` does not need to accept parameters.
ghstack-source-id: 134626245
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29986446
fbshipit-source-id: 6a8b709e4383a3c44b9e60955fbb067cd2868e76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62111
This base class will be passed to the post-localSGD optimizer in the next PR. This way, the same post-localSGD optimizer can choose different model averaging algorithms.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134489187
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: rohan-varma
Differential Revision: D29884954
fbshipit-source-id: 1dc5e35c58895902991567f633afd621c7108938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62079
Adds support for kwarg arguments into functional optimizer running as
hook.
ghstack-source-id: 134330379
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29838127
fbshipit-source-id: 2ab051ef5f0dff19c145ebe2260668b927ba47b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62177
Reland of https://github.com/pytorch/pytorch/pull/61678
Fix CI failure by gating including torchvision model on whether torchvision is available or not.
ghstack-source-id: 134282165
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29904101
fbshipit-source-id: 47e799eb4a90acbbda91c5857ea00de3045d49f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61678
This diff makes the following changes: - Add `step_param` method to `_FunctionalSGD` class which is written similar to `step` but for a single param - Implement a communication hook wrapper that runs a given comm. hook and then applies functional SGD step - Verifies that this is equal to regular allreduce + SGD optimizerghstack-source-id: 133567598
ghstack-source-id: 134263399
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29701447
fbshipit-source-id: 183954593b82a092414623292f9b10e675fef96e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62105
This is for the preparation of wrapping the averager as an optimizer, which can only accept parameters rather than a module.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 134213572
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_average_parameters
Reviewed By: rohan-varma
Differential Revision: D29883693
fbshipit-source-id: 474ba924a0b05068b12f163fb74582bccf314964
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62074
Since SPMD mode is retired, the comm hook result will always be a single tensor.
This can improve comm hook developer experience, as no need to add an extra `[0]` to the precursor future result.
#Closes: https://github.com/pytorch/pytorch/issues/61914
ghstack-source-id: 134164593
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork
Reviewed By: rohan-varma
Differential Revision: D29864732
fbshipit-source-id: 59fe6dd78b66214b1788514ad4d236039d9bda31
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.
**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)
The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.
This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.
Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.
**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).
**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555
Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```
Reviewed By: zou3519
Differential Revision: D29690359
Pulled By: andwgu
fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61207
Model averager now must be combined with post-localSGD DDP communication hook. It will skip model averaging for the first K steps, because post-localSGD communication hook will run global gradient averaging during this phase.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371335
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_periodic_model_averager
Reviewed By: pritamdamania87
Differential Revision: D29523738
fbshipit-source-id: 3fa9611046e1c0afa4bda78aa3ba200fa2a5fa4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61206
Create a communication hook to run post-local SGD. This will be combined with model averager component to better support local SGD.
In contrast to the previous approach that runs local gradient averaging + global model averaging at each step for the first K steps, now we plan to run global gradient averaging only for the first K steps at each step, just like normal DDP. This can give us two advantages:
1) For some optimizers, model averaging can cause discrepancy in optimizer states. If we still do global gradient averaging for the first K steps, we can defer such discrepancy until we actually start local SGD.
2) Gradient averaging at the first K steps only run one allreduce that overlaps with backward pass, so it should also be more efficient.
Proposal: https://github.com/pytorch/pytorch/issues/59699
ghstack-source-id: 133371322
Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_hook_parity_post_localSGD
Reviewed By: pritamdamania87
Differential Revision: D29523292
fbshipit-source-id: 3f215f7150f2917c2781278fad759530c685ea2c
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.
**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)
There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.
The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.
The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.
**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757
Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```
Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.
Reviewed By: iramazanli, mrshenli
Differential Revision: D29624636
Pulled By: andwgu
fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60891
This fix is particularly useful for local SGD when the averaging period is very small, which may cause the conflict between gradient allreduce within per-machine subgroup and the global parameter allreduce by the communication world.
ghstack-source-id: 132564252
Test Plan:
f281873295 (#Try1) failed due to the conflict between global process group and subgroup.
```
<Thread(configerator-monitor-singleton, started 139839806633728)>
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/tmp/jetter.gson7tr3/configerator/client.py", line 348, in _monitor_loop
self._parent_thread.join(self._interval_ms / 1000)
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1015, in join
self._wait_for_tstate_lock(timeout=max(timeout, 0))
File "/usr/local/fbcode/platform009/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
```
Fixed after adding an explicit sync: f282044866, f282241800
Reviewed By: rohan-varma
Differential Revision: D29434597
fbshipit-source-id: a4f777fc26f379639f85fda32de425cd3b337b33