Commit Graph

21 Commits

Author SHA1 Message Date
Yuanyuan Chen
fc8ac1216c [4/N] Remove unused loop variables in tests (#166690)
This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690
Approved by: https://github.com/justinchuby, https://github.com/mlazos
2025-10-31 10:20:48 +00:00
Xuan Zhang
6e680ae8de add more restriction to fusion with large accumulate reads (#163163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163163
Approved by: https://github.com/yf225
2025-09-19 01:20:30 +00:00
Xuan Zhang
da669d51bf fusion of large accumulated reads only at ir level (#161978)
This is to revert some of the changes in https://github.com/pytorch/pytorch/pull/158667

In particular, we only disallow fusion of large accumulate read at IR level and not at scheduler level, as users can create their own custom fusion logics for the scheduler level.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161978
Approved by: https://github.com/yf225
2025-09-13 04:07:25 +00:00
Boyuan Feng
5f1010fbb3 [Graph Partition] Pass all OSS unit tests (#154667)
Graph partition leads to 6.2% speedup on vision_maskrcnn, 5.8% speedup on yolov3. [P1819700563](https://www.internalfb.com/phabricator/paste/view/P1819700563), 39.5% speedup on speech_transformer inference [P1830602200](https://www.internalfb.com/phabricator/paste/view/P1830602200), 85% speedup on speech_transformer training [P1831115315](https://www.internalfb.com/phabricator/paste/view/P1831115315).

Run the same diff on two days and both show speedup on average.

[first TorchInductor Benchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2021%20Jul%202025%2016%3A37%3A55%20GMT&stopTime=Mon%2C%2028%20Jul%202025%2016%3A37%3A55%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=75ef90fe89b82c967362a2d40fdf1af047202bc2&rBranch=main&rCommit=abcb24f4de11f8fedf2c2c9ff53b6092ef42306d)
<img width="1885" height="752" alt="image" src="https://github.com/user-attachments/assets/13bba9fc-5dbf-42ad-8558-d54f7e367b41" />

[second TorchInductorBenchmark ci run](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2023%20Jul%202025%2016%3A38%3A27%20GMT&stopTime=Wed%2C%2030%20Jul%202025%2016%3A38%3A27%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=bf/partition-turn-on&lCommit=66de27e29338c26b1be94733049868cb0309ea52&rBranch=main&rCommit=70d2e9ba455c3c910f6f95b24171c8eee7bc00bf)
<img width="2513" height="1030" alt="image" src="https://github.com/user-attachments/assets/3a413dcb-2314-4292-919a-7ca181f9eeac" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154667
Approved by: https://github.com/eellison
2025-08-12 04:37:58 +00:00
xinan.lin
8047421fbb [Linter] Expanding the scope of detecting device-bias code. (#159949)
Currently, the device-bias linter only targets functions decorated with @requires_gpu. This PR adds support for two new detection scenarios:
1. Detect device-bias code in functions decorated with @requires_triton.
2. Detect device-bias code for entire test suites that are defined as shared across GPUs. For example:
```
if __name__ == "__main__":
    if HAS_GPU:
        run_tests()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159949
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-08-09 09:41:16 +00:00
eellison
eb25a95a6e Fix inductor memory estimation when a single buf has multiple mutations. Add runtime verification of mem tracking (#159569)
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:

```
    When an operation mutates a buffer in-place, the scheduler creates a new buffer name
    to track the "before" and "after" states, even though they share the same memory.
    The mutated buffer represents a rename with zero allocation and deallocation cost.
    During dependency tracking, we transfer dependencies from the mutated name back to
    the original buffer, ensuring the original memory is only freed when all aliases
    are done.
    This handles cases where a buffer has multiple non-overlapping aliases - rather than
    trying to assign free costs to individual aliases, we forward all alias dependencies
    to the original buffer.
    Consider:
        buf0 = op0()
        buf1 = mutation_op_(buf0)
        del buf0
        ...
        op(buf1)
        del buf1
    The only memory events are the creation prior to op0, and the deletion following buf1.
```

As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.

This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
2025-08-05 19:58:11 +00:00
Xuan Zhang
6b0526a2c4 ban fusion of large amount of reads (#158667)
This is an reland attempt of https://github.com/pytorch/pytorch/pull/157563, but insteading of introducing the `realize_acc_reads_size_threshold` config and setting to a default value, we set it to `None` for now to unblock an internal use case. Will deep dive into the issue and harden the logic in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158667
Approved by: https://github.com/yf225
2025-07-21 21:00:40 +00:00
Jack Taylor
7ebbf2cae7 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563) (#158550)
This reverts commit 8554c8007d #157563 due to causing a few breakages on ROCm

Reverted expected_results.csv to 26807dcf27

> @xuanzhang816 Sorry, but I have to revert this PR yet again because it clearly reintroduced failures on ROCm after the remerge: f4d8bc46c7/2
and the failures are still showing up on tip-of-tree on HUD

Context
https://github.com/pytorch/pytorch/pull/157563#issuecomment-3083350857

Needs to be relanded in non bc-breaking way, or sanity checked for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158550
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-07-17 19:47:41 +00:00
Xuan Zhang
8554c8007d [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-16 01:05:25 +00:00
PyTorch MergeBot
26807dcf27 Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit c062550a35.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/clee2000 due to broke test_linear_and_cel on main c062550a35, caused OOM? Also broken on PR, Dr. CI classification is wrong (claims the test is disabled by an issue but the issue is for a different test).  Also I'm pretty sure the expected results json is supposed to have a ton of empty lines, its to prevent merge conflicts, I will add it to the linter ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3074355331))
2025-07-15 16:35:55 +00:00
Xuan Zhang
c062550a35 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-14 22:27:21 +00:00
PyTorch MergeBot
e90148c91d Revert "[PT2][fusion] ban fusions with large accumulated reads (#157563)"
This reverts commit 4b9a6f7211.

Reverted https://github.com/pytorch/pytorch/pull/157563 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it might contribute to a string of OOM error in trunk ([comment](https://github.com/pytorch/pytorch/pull/157563#issuecomment-3064678929))
2025-07-12 04:52:11 +00:00
Xuan Zhang
4b9a6f7211 [PT2][fusion] ban fusions with large accumulated reads (#157563)
**Problem:**
Fusion can accumulate large amount of reads, which leads to significant increase in peak memory utilization. Imagine we have the following code snippet
```
total = torch.rand(N, N)
for _ in range(r):
    x = torch.rand(N, N)
    total = total + x
```
The default execution is memory efficient as only two tensors of size N-by-N is in memory at any given time. However, with fusion, the additions are fused into a single operation and the execution becomes something like:
```
x_1 = torch.rand(N, N)
x_2 =  torch.rand(N, N)
...
x_r = torch.rand(N, N)
total = x_1 + x_2 + ... + x_r
```
Though this is run-time efficient, in the case of large `N` and/or large `r`, this is not memory efficient.

[internal only] see [post](https://fb.workplace.com/groups/1075192433118967/permalink/1703374333634104/) for additional details

**Solution:**
Our proposed solution is to ban fusions in case where a large amount of reads are accumulated. This is in addition to some existing logics during torch compile.
* During lowering (i.e., `ir.py`), the config `realize_acc_reads_threshold`, which is default to be 8, controls _the number of_ buffers can be accumulated for a single operator. However, this is oblivious to the size of the buffers. Hence, we additionally introduce a config `realize_acc_reads_size_threshold` to control _the amount of buffers_ in size that can be accumulated.
* During scheduling (i.e., `scheduler.py`), additional fusion will be performed and thus we also need to capture such pattern there. The decisions are implemented under `choices.py`.

**Results:**
For a small example similar to be one in the test case (but with larger `N` and higher number of loop repeats), the memory snapshot before and after are shown below. Note the snapshot on the right is zoomed out so that the y-axis of the two snapshots match.

<img width="1328" alt="image" src="https://github.com/user-attachments/assets/670b5961-8454-4379-ae0f-62d4e7946c64" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157563
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-07-11 21:07:57 +00:00
Xuan Zhang
39456edbba [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-09 22:14:20 +00:00
PyTorch MergeBot
6defd5084e Revert "[PT2][memory] mutation size correctness (#157562)"
This reverts commit 86670b39fa.

Reverted https://github.com/pytorch/pytorch/pull/157562 on behalf of https://github.com/xuanzhang816 due to internal_test_failure ([comment](https://github.com/pytorch/pytorch/pull/157562#issuecomment-3053115025))
2025-07-09 15:38:29 +00:00
Xuan Zhang
86670b39fa [PT2][memory] mutation size correctness (#157562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157562
Approved by: https://github.com/yf225
2025-07-08 14:02:20 +00:00
Shunting Zhang
888110841c [inductor] don't fuse two nodes if likely increase peak memory (#138756)
Partially fixing https://github.com/pytorch/pytorch/issues/138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
2024-11-04 20:49:29 +00:00
Xuan Zhang
2980aed65b [inductor][memory] restructuring memory.py and turn on the flag (#137205)
Addressing additional comments given in PR https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137205
Approved by: https://github.com/eellison
2024-10-25 17:19:34 +00:00
Shunting Zhang
0a38c0ec89 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-22 00:50:00 +00:00
xinan.lin
190e09d8b6 [Inductor UT] Generalize device-bias code introduced from #134874 and (#136596)
[Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases.
Fix #136595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-09-26 02:56:59 +00:00
Xuan Zhang
03957efa5d [inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874)
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).

**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory

**Results**:
On some models we can reduce the peak memory significantly:
|             model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet                       | 128        |         1.17         |       0.99      | 1.19  |
| vgg16                         | 64         |         4.10         |       3.57      | 1.15  |
| DebertaV2ForQuestionAnswering | 1          |         11.60        |      10.56      | 1.10  |

In the presence of compiler based AC, peak memory can be further reduced:
|              model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM              | 4          |         6.87         |       6.43      | 1.07  |
| AlbertForQuestionAnswering     | 4          |         8.69         |       7.76      | 1.12  |
| MobileBertForQuestionAnswering | 128        |         4.67         |       3.90      | 1.20  |

[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.

**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
2024-09-21 16:28:38 +00:00