With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:
```
When an operation mutates a buffer in-place, the scheduler creates a new buffer name
to track the "before" and "after" states, even though they share the same memory.
The mutated buffer represents a rename with zero allocation and deallocation cost.
During dependency tracking, we transfer dependencies from the mutated name back to
the original buffer, ensuring the original memory is only freed when all aliases
are done.
This handles cases where a buffer has multiple non-overlapping aliases - rather than
trying to assign free costs to individual aliases, we forward all alias dependencies
to the original buffer.
Consider:
buf0 = op0()
buf1 = mutation_op_(buf0)
del buf0
...
op(buf1)
del buf1
The only memory events are the creation prior to op0, and the deletion following buf1.
```
As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.
This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices
### CHANGES
- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py
Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.
- torch/testing/_internal/common_distributed.py
extended common utility functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)
This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").
The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes.
- The unit tests are now fixed and re-enabled.
- Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573
Approved by: https://github.com/Chillee
ghstack dependencies: #129980
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.
Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).
Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.
This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.
Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).
This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements:
- At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise **all** of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule.
- The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`.
From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).
----
Q: Can we leverage `ir.Subgraph`?
A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that:
1. `ir.Subgraph` requires defining the subgraph in FX IR.
2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node.
3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR).
----
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568
Approved by: https://github.com/eellison, https://github.com/mlazos
This PR completely removes the Inductor IR for legacy functional collectives:
- Removed the `CollectiveKernel` hiearchy and `Wait`, as well as the corresponding lowerings. These IRs are target (i.e. Python) specific and don't model node dependencies propoerly (e.g. they rely on `never_reuse_buffers` for correct behavior). They've been superceded by `ir._CollectiveKernel`.
- Removed `InPlaceHint` and the scheduler logic for handling it. `InPlaceHint` is a codegen-time buffer reuse mechanism controlled by the IR's codegen. It's a bit hacky and overlaps with the default buffer reuse mechanism. Removing it since it is only used by legacy functional collectives.
- Removed `OutputBuffer` and `MultiOutputNoSizeAssert` which are designed for and only used by legacy functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124992
Approved by: https://github.com/Chillee, https://github.com/wanchaol
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.
Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol