Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457
Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor.
Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`.
ghstack-source-id: 134771419
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type
Reviewed By: rohan-varma
Differential Revision: D30007390
fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232
Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf.
Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after).
ghstack-source-id: 134663867
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29922299
fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231
`compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket.
This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact.
ghstack-source-id: 134480575
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D29919056
fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992
This test previously was not enabled for static graph but to ensure
this feature is supported with DDPSink, enable it for static graph which
currently passes outputs to DDPSink.
ghstack-source-id: 134471406
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29830887
fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497
Reverts [DDP] Support not all outputs used in loss calculation
ghstack-source-id: 133589153
Test Plan: CI, ping authors to run their workflow on this diff
Reviewed By: zhaojuanmao
Differential Revision: D29642892
fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401
Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design.
Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races.
ghstack-source-id: 133589152
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29608948
fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46
Summary:
**Overview:**
This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania.
**Changes:**
This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.)
The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance.
This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition.
Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user.
**New API:**
Now, the example usage would look like:
```
ddp_model = DistributedDataParallel(...)
zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...)
with _Join([ddp_model, zero_optim]):
...
```
Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example:
```
with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False):
...
```
They will be forwarded to every `_join_hook()` function via `**kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`).
**Recap:**
After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names):
- Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context.
- We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`.
- To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed.
- We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`.
- We call `Joinable.__init__(self)` in `C`'s constructor.
- The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability.
- Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555
Test Plan:
I ran the existing DDP join tests:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```
I ran the ZeRO join tests:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu
```
Reviewed By: zou3519
Differential Revision: D29690359
Pulled By: andwgu
fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7
Summary:
Targets https://github.com/pytorch/pytorch/issues/54318.
**Overview:**
DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.)
There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager.
The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications.
The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same.
**Notes:**
- The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing.
- `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks.
- `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`).
- Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden.
- If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications.
- If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757
Test Plan:
The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py --
```
Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually.
Reviewed By: iramazanli, mrshenli
Differential Revision: D29624636
Pulled By: andwgu
fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376
After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API.
The `allreduce` hook example does not do division inside, which actually is incorrect.
ghstack-source-id: 133174272
Test Plan: N/A
Reviewed By: rohan-varma
Differential Revision: D29596857
fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020
Makes uneven input support with `join` context manager work with
custom communication hooks. This will ensure that the two features can work
well together. Added relevant unittests to test allreduce and powerSGD hooks.
Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed.
ghstack-source-id: 132950108
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29480028
fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019
Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff.
ghstack-source-id: 132950107
Test Plan: ci
Reviewed By: SciPioneer
Differential Revision: D29480027
fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017
Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook.
ghstack-source-id: 132924223
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29477360
fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137
Summary:
We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275
Reviewed By: SciPioneer
Differential Revision: D29502609
Pulled By: rohan-varma
fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882
Fixes https://github.com/pytorch/pytorch/issues/60733, which
identified an issue with a previous PR that resulted in DDP no longer
supporting cases where newly created tensors are returned that don't have a
grad_fn. The result of this is the grad_fn is set to that of the `DDPSink`
custom backward which results in errors during the backwards pass.
This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors
if it is `None`. Added relevant tests as well.
ghstack-source-id: 132632515
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29423822
fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359
Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`.
ghstack-source-id: 131774159
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28855226
fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60118
Pipe + DDP has a few issues:
1) with static graph, does not synchronize gradients on first backward pass (i.e. delay allreduce is not run). does not work since https://github.com/pytorch/pytorch/pull/55248
2) when find_unused_parameters=True, also does not results in gradient synchronization. does not work since https://github.com/pytorch/pytorch/pull/57081
The reason for both cases is that calling `DDPSink.apply(output_tensor)` does not call the custom `backward` of `DDPSink` when the `output_tensor` is actually an `OwnerRRef`, which is the case when running DDP in `Pipe`. This is because we do `backward` on the `rref.local_value()` which does not have this autograd recording.
To fix, we unwrap the RRef and reconstruct it as needed, similar to the fix in https://github.com/pytorch/pytorch/pull/49908.
to test:
All tests in pipe_with_ddp_test pass.
The reason these tests did not catch the errors earlier is because all ranks received the same model inputs. So if gradient synchronization did not occur, then grads would still be the same because the model is the same on all ranks (guaranteed by ddp). Fixed the tests to use different inputs across ranks.
ghstack-source-id: 131688187
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D29167283
fbshipit-source-id: fe62310db2dc6de8519eb361b1df8ae4dfce3ab8
Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets
Test Plan:
Main command line to run:
*./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh*
Config we used is *msuru_suru_config.json*, which is "Normal ResNeXt101 with finetunable head".
Experiments:
- msuru_suru_union f279939874
- Train/test split
- msuru_suru_union_dataset_train_w_shard: 143,632,674 rows
- msuru_suru_union_dataset_test_w_shard: 1,831,236 rows
- Results
{F625232741}
{F625232819}
- ig_msuru_suru_union f279964200
- Train/test split
- ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows
- ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows
- Results
{F625234126}
{F625234457}
Differential Revision: D29154971
fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284
Logs a few python-side errors to DDP logging.
TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28820290
fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281
Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.
Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28652717
fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065
Cleaner to use work.result() instead of sending back the tensor from
this function.
ghstack-source-id: 130338813
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28551203
fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57878.
This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`).
The change was tested via the following on an AI AWS cluster:
`WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109
Reviewed By: H-Huang, SciPioneer
Differential Revision: D28761148
Pulled By: andwgu
fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595
No longer needed since this list is always of size 1.
ghstack-source-id: 129498229
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28548426
fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081
Changes in this diff:
Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.
Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.
Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28048628
fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58089
make ddp logging api to be private
ghstack-source-id: 128796419
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D28365412
fbshipit-source-id: 374c01d443ffb47a3706f59e296d6e47eb5f4c85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58105
When find_unused_parameters=True but static_graph is also set, static graph handles unused parameter accounting, so this code path is not needed
ghstack-source-id: 128736289
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D28371954
fbshipit-source-id: 0b42a9c0fd2bba26a0de288436e9c7139e292578
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57073
Enhances use of DDPSink to work for all output types DDP supports as per https://github.com/pytorch/pytorch/issues/55876.
TODO: Add additional testing for tuple, list, dict return types
ghstack-source-id: 128726768
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D27756985
fbshipit-source-id: 2e0408649fb2d6a46d6c33155a24c4c1723dd799
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57999
make ddp logging api to be private
ghstack-source-id: 128607185
Test Plan: unit test
Reviewed By: rohan-varma
Differential Revision: D28338485
fbshipit-source-id: bd2ae7c78904e93eed88be91876f5a832b5b7886
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55248
This PR provides enable static graph training when users call _set_static_graph(). This can help support more use cases in DDP without performance regression, also can potentially improve performance when there are unused parameters in the graph.
1. first iteration records graph states like how many times a grad is calculated, whether the grad is used or not. then first iteration queues a delay_all_reduce call back to all reduce grads.
2. Since autograd call back is associated with current target graph task, the delay_all_all call back should be associated with out-most backward graph task. A DDP sink layer is added in DDP forward loop so that we can queue the delay_all_reduce call back in the sink layer.
3. after first iterations, DDP will use the saved graph states to determine whether a grad is used or not. whether a grad is ready for communication.
4. rebuilt bucket is called in second iteration, after graph states are recorded in first iteration.
5. if the graph states change, DDP will throw errors
ghstack-source-id: 128599464
Test Plan: unit tests. adding more tests
Reviewed By: rohan-varma
Differential Revision: D27539964
fbshipit-source-id: 74de1ad2719465be67bab8688d6e293cd6e3a246
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57771
This mapping didn't work properly when certain parameters didn't
require grad. Fixed that and added a test.
ghstack-source-id: 128527537
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D28265636
fbshipit-source-id: 7b342ce012b2b7e33058b4c619ffb98992ed05b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56755
Rehash of https://github.com/pytorch/pytorch/pull/47488
Adds a flag to ddp join() context manager that enables throwing a
StopIteration across all ranks when this flag is specified.
To do this, we implement the design in #47250. When running with this flag, we schedule an additional allreduce in the case that a joined rank needs to throw a StopIteration. In non-joined ranks forward pass, we match this allreduce and if at least one rank tells us to throw, we raise a StopIteration.
Tested by modifying existing tests, as well as adding additional tests validating that this works with SyncBatchNorm models and a model with custom collectives in the forward pass.
Currently running perf benchmarks, will post when those are available, but we expect a small (~2%) perf reduction when enabling this feature due to the blocking allreduce. Hence we will only recommend it for models with collective comm.
ghstack-source-id: 127883115
Test Plan: Ci
Reviewed By: SciPioneer
Differential Revision: D27958369
fbshipit-source-id: c26f7d315d95f17bbdc28b4a0561916fcbafb7ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54995
provide an DDP private API to explicitly set the training is static, also set this flag in logger
ghstack-source-id: 127755713
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D27444965
fbshipit-source-id: 06ef1c372296815944b2adb33fbdf4e1217c1359
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56730
add a test to verify DDP with torch map will result in the same results when using grad_as_bucket_view=true and false.
torch.amp scale factor does not have dependencies on old gradients, thus it is not affected by grad_as_bucket_view=true or false, see
how torch.amp is implemeted here https://github.com/pytorch/pytorch/pull/33366/files.
This diff verified ddp can work as expected with amp.GradScaler and amp.autocast when when using grad_as_bucket_view=true and false.
ghstack-source-id: 127526358
Test Plan: unit tests
Reviewed By: rohan-varma
Differential Revision: D27950132
fbshipit-source-id: 8ed26935fdcb4514fccf01bb510e31bf6aedac69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56641
currently ddpLoggingData is flat struct, which requires internal DDP developers and external users to know about the struct field names. This is not flexible to delete or add new fields in the future. also it is hard to access ddpLoggingData.
With maps/dict, developers and users can easily access the fields without knowing the field names, also easier to add/remove a new/old field.
Since C++ does not support map values to be different types, right now ddpLoggingData containes two types of maps.
ghstack-source-id: 127482694
Test Plan: unit tests
Reviewed By: SciPioneer
Differential Revision: D27923723
fbshipit-source-id: c90199c14925fc50ef219000e2f809dc7601cce1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55946
As `ddp_gpu_size` field of `SyncBatchNorm` will always be 1 for GPU modules, remove this field and the relevant code.
ghstack-source-id: 126883498
Test Plan: waitforbuildbot
Reviewed By: zhaojuanmao
Differential Revision: D27746021
fbshipit-source-id: b4518c07e6f0c6943fbd7a7548500a7d4337126c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55075
Constructs and passes in a mapping with parameter names to Reducer to log information about unused parameters in error messages about unused parameters/not all parameters getting gradient.
Use case:
1) User runs DDP forward + bwd, and it has some unused parameters that will result in ddp error in next iteration
2) Next forward pass calls `Reducer::ensure_prior_reduction_finished()` where we check all params got gradient from the previous bwd pass. DDP would throw here in this case.
3) Reducer maintains mapping and tracks used parameters, and computes which parameters did not get gradient and logs this as part of the error.
Implementation details:
0) The following is only enabled for debug modes of INFO or DETAIL.
1) To save memory, we don't map param -> param name so that we don't have to copy the entire tensor, instead we map param_index -> param_name and use the existing concept of variable_index in Reducer to look up parameter names.
2) DDP constructs param index -> param name mapping. The name is the fully qualified name: f"{module_name}:{param_name}" and passes it into Reducer
3) Reducer maintains per-iteration std::set<int> of variable indices that have had `mark_variable_ready` called.
4) When some params go unused, we take a set difference to detect the unused params.
5) Unittests to test the logged unused params, as well as for nested modules, are added
ghstack-source-id: 126581051
Test Plan: CI, UT
Reviewed By: zhaojuanmao
Differential Revision: D27356394
fbshipit-source-id: 89f436af4e74145b0a8eda92b3c4e2af8e747332
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55353
Remove all the code branches that will only be executed when `device_ids > 1`.
Some helper functions are also removed:
1. `_verify_replicas_within_process` and `verify_replicas_within_process`
2. `_replicate_modules_within_process`
3. `parallel_apply`
The next step is deprecating `_module_copies` field.
ghstack-source-id: 126201121
Test Plan: waitforbuildbot
Reviewed By: rohan-varma
Differential Revision: D27552201
fbshipit-source-id: 128d0216a202f5b1ba4279517d68c3badba92a6c