pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

Chien-Chin Huang 1d2382f141 [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab		2024-02-08 03:03:15 +00:00
..
_checkpoint
_comm_hooks
_optimizer_overlap
_quantization
ddp_comm_hooks	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 )	2024-02-08 03:03:15 +00:00
model_averaging	Docs: fix docstring errors in model_averaging (#117038 )	2024-01-18 04:12:51 +00:00
__init__.py
join.py