Summary: Following implementation of the updated ATen Backend for mtia, and diffs enabling in tree view ops (D75266206, D75385411), we can remove custom logic from reducer to handle MTIA view operations.
Test Plan:
CI
Rollback Plan:
Reviewed By: egienvalue
Differential Revision: D77843212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157882
Approved by: https://github.com/albanD, https://github.com/andyanwang
Differential Revision: D72437251
Enable to rebuild bucket order when find_unused_parameters=true.
It should be always better than not rebuilding bucket order when find_unused_parameters=True:
1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything
2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases:
a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain
b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter
it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404
Approved by: https://github.com/rohan-varma, https://github.com/fegin
C++ Reducer is silently incorrect under CA, its implementation is no-oping the collective. I'm guessing that it was no-op'd because in DDP + python reducer, the C++ reducer is still being initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152735
Approved by: https://github.com/fegin
ghstack dependencies: #153300, #152689
Summary: add one option to allow skipping all reduce unused parameters, this could help improve training throughput significantly when the number of unused parameters is large in the model.
Test Plan: unit tests, CI
Differential Revision: D72282069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151503
Approved by: https://github.com/mrshenli
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.
Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```
Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`
Added a unit test as well that reproduced the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
Fixes#69031, #42793
This PR fixes the bug introduced in #54981 where parameters used within a `no_sync` scope are not respected when `find_unused_parameters` is set to `True`. The `local_used_map_` and `numGradHooksTriggeredMap_` variables should be updated regardless of the `no_sync` state.
Tested and verified with fairseq2 and wav2vec2 ASR finetuning recipe. All gradients are correctly synced across workers as expected after applying this fix.
Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124193
Approved by: https://github.com/rohan-varma
**Summary**
The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.
**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.
**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.
The bucketing is done in a separate PR.
Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.
As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
Summary:
The getCvar* functions allow us to provide multiple environment variables for the same value. This allows us to deprecate some variables in favor of others, while still allowing users to temporarily use the old variables for some time.
Test Plan: OSS CI
Reviewed By: fduwjj, XilunWu
Differential Revision: D51225487
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113797
Approved by: https://github.com/fduwjj