pytorch/torch/distributed
Andrew Gu c932b39739 [FSDP2] Added _set_unshard_async_op (#135523)
This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation.

If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute.

Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523
Approved by: https://github.com/weifengpy
2024-09-10 19:28:02 +00:00
..
_composable [FSDP2] Added _set_unshard_async_op (#135523) 2024-09-10 19:28:02 +00:00
_shard [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
_sharded_tensor
_sharding_spec
_symmetric_memory [micro_pipeline_tp] support all _scaled_mm args (#131984) 2024-08-05 21:44:37 +00:00
_tensor [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
_tools Runtime Estimator for estimating GPU compute time (#134243) 2024-08-28 20:06:54 +00:00
algorithms [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
autograd
benchmarks [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
checkpoint [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535) 2024-09-10 03:10:00 +00:00
elastic [elastic] support local_addr across all rendezvous impls (#135262) 2024-09-06 17:55:43 +00:00
examples
fsdp [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
launcher
nn Revert "added persistent option to buffers and namedbuffers (#132994)" 2024-08-09 18:14:53 +00:00
optim Revert "[BE] typing for decorators - _jit_internal (#131573)" 2024-07-28 03:29:32 +00:00
pipelining [PP] Fix zero bubble composability with DP (#134052) 2024-09-04 23:46:29 +00:00
rpc [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200) 2024-08-15 15:50:19 +00:00
tensor [CP] Extend CP to support load-balancing shards (#132442) 2024-09-09 18:04:38 +00:00
__init__.py Remove ProcessGroupRoundRobin (#132888) 2024-08-08 01:07:40 +00:00
_checkpointable.py
_composable_state.py
_functional_collectives_impl.py
_functional_collectives.py [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
_state_dict_utils.py [reland][dtensor] move DTensor to public namespace (#134203) 2024-09-08 17:08:40 +00:00
argparse_util.py
c10d_logger.py [DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423) 2024-07-11 13:43:39 +00:00
collective_utils.py
constants.py
CONTRIBUTING.md
device_mesh.py [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271) 2024-09-06 06:23:20 +00:00
distributed_c10d.py Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)" 2024-08-30 16:27:40 +00:00
launch.py
logging_handlers.py
remote_device.py [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206) 2024-07-14 08:17:52 +00:00
rendezvous.py
run.py fix torchrun log message (#131652) 2024-07-25 14:50:10 +00:00
utils.py [FSDP] casting input args with dataclass(frozen=True) (#135067) 2024-09-05 01:19:53 +00:00