Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim
Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass. The logs should be in the inner part, so
they are functional also during unit testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.
This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`
Related PR:
https://github.com/pytorch/pytorch/pull/150564
Test Plan: CI
Differential Revision: D75714362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).
This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.
FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.
This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.
Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.
Test Plan:
:)
Rollback Plan:
Reviewed By: Microve
Bifferential Revision: D75985733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
This is a new pass to replace the pre-existing passes. It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.
The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
not increaes curr memory beyond peak memory
The pass logs a summary table for each graph to TORCH_LOGS=overlap.
e.g. (exact format may have been tweaked but this shows the idea).
```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4
[rank
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap.
This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822
Approved by: https://github.com/eellison
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`
Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).
This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.
------
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)
This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").
The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
Resubmit of #128979
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.
Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`
So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.
This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
Resubmit of #129325
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.
Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano