Maggie Moss
eb83c3ca23
Clean up unused Pyrefly suppressions ( #166178 )
...
Cleaning up ignores that are no longer needed in the repo and adding select suppressions so the main branch is clean.
test plan:
`lintrunner -a`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166178
Approved by: https://github.com/oulgen
2025-10-25 05:32:21 +00:00
Yuanyuan Chen
e595136187
Enable PLC1802 on ruff ( #165813 )
...
This PR enables ruff check `PLC1802`, which detects len calls on sequences in a boolean test context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165813
Approved by: https://github.com/ezyang
2025-10-18 05:44:14 +00:00
Maggie Moss
9944cac6e6
Add suppressions to torch/_inductor ( #165062 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
Split this directory into two PRs to keep them from being too large.
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062
Approved by: https://github.com/oulgen , https://github.com/mlazos
2025-10-09 20:34:20 +00:00
Yuanyuan Chen
35c4130fd1
[2/N] Fix ruff warnings ( #164460 )
...
Apply ruff `SIM` rules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
IvanKobzarev
25c170b72e
[inductor] Runtime estimations: use nccl estimator; mm only benchmark mode ( #161405 )
...
During comms reordering , sink wait iterative observed previous runtime estimations pretty off for collectives and mms.
Adding optional usage of:
- c10d.time_estimator for collectives, which is based on NCCL estimator
Benchmark mode only for matmuls, as they are highly dependent on mm backend
- The logic mostly copied from Ruisi's PRs for inductor simple_fsdp https://github.com/pytorch/pytorch/pull/157572
This estimations corrections are in default `BaseSchedulerNode.estimate_runtime()`
Differential Revision: [D81152294](https://our.internmc.facebook.com/intern/diff/D81152294 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161405
Approved by: https://github.com/eellison
2025-09-08 14:33:19 +00:00
IvanKobzarev
db44de4c0d
[inductor] Estimate peak memory allocfree and applying to reordering collectives ( #160113 )
...
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
"""
Alternative version of estimate_peak_memory, that respects the fact,
that every SchedulerNode has multiple phases:
1. alloc ( outputs )
2. run_kernel
3. dealloc last_use buffers
estimate_peak_memory collapses memory into one value: size_alloc - size_free
While peak memory happens after alloc.
Duplicating the code to not migrate all callsites at once,
In future usages of estimate_peak_memory will migrate to this version.
"""
```
- Applying this in `reorder_communication_preserving_peak_memory` pass.
2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.
- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).
4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.
What is after this PR:
Iterative recomputation of memory estimations matches full memory estimations.
Active memory is not regressing a lot, but reserved memory is significantly regressed.
Investigation and fix of "reserved" memory will be in following PRs.
BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55%
```
REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90%
```
Differential Revision: [D80718143](https://our.internmc.facebook.com/intern/diff/D80718143 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab , https://github.com/eellison
Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-22 14:19:57 +00:00
PyTorch MergeBot
7006fd0c88
Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives ( #160113 )"
...
This reverts commit 517d38d340 .
Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/IvanKobzarev due to Segment tree starts failing on trunk even ciflows/trunk passed on PR ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3211286092 ))
2025-08-21 16:22:44 +00:00
IvanKobzarev
517d38d340
[inductor] Estimate peak memory allocfree and applying to reordering collectives ( #160113 )
...
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
"""
Alternative version of estimate_peak_memory, that respects the fact,
that every SchedulerNode has multiple phases:
1. alloc ( outputs )
2. run_kernel
3. dealloc last_use buffers
estimate_peak_memory collapses memory into one value: size_alloc - size_free
While peak memory happens after alloc.
Duplicating the code to not migrate all callsites at once,
In future usages of estimate_peak_memory will migrate to this version.
"""
```
- Applying this in `reorder_communication_preserving_peak_memory` pass.
2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.
- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).
4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.
What is after this PR:
Iterative recomputation of memory estimations matches full memory estimations.
Active memory is not regressing a lot, but reserved memory is significantly regressed.
Investigation and fix of "reserved" memory will be in following PRs.
BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55%
```
REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90%
```
Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab , https://github.com/eellison
Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 15:45:06 +00:00
PyTorch MergeBot
bd5857a1d6
Revert "[inductor] Estimate peak memory allocfree and applying to reordering collectives ( #160113 )"
...
This reverts commit 9d18bf01b1 .
Reverted https://github.com/pytorch/pytorch/pull/160113 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of failures showing up after this lands ([comment](https://github.com/pytorch/pytorch/pull/160113#issuecomment-3209487237 ))
2025-08-21 08:13:33 +00:00
IvanKobzarev
9d18bf01b1
[inductor] Estimate peak memory allocfree and applying to reordering collectives ( #160113 )
...
1. Applying @eellison idea from https://github.com/pytorch/pytorch/pull/146562#discussion_r2059363672 for estimate_peak_memory:
```
"""
Alternative version of estimate_peak_memory, that respects the fact,
that every SchedulerNode has multiple phases:
1. alloc ( outputs )
2. run_kernel
3. dealloc last_use buffers
estimate_peak_memory collapses memory into one value: size_alloc - size_free
While peak memory happens after alloc.
Duplicating the code to not migrate all callsites at once,
In future usages of estimate_peak_memory will migrate to this version.
"""
```
- Applying this in `reorder_communication_preserving_peak_memory` pass.
2. Buffers during reordering can change deallocation point, if candidate and group to swap both are users of the f_input_buf and group contains last_use_snode.
- Addressing this tracking the last_use_snode for each buffer and recomputing current memory respecting the change in size_free (group_node after reordering is not the last user of the buffer and its size_free -= buffer_size, while candidate becomes the last user and candidate.size_free += buffer_size).
4. Adding env var `PYTORCH_REORDER_COLLECTIVES_LIMIT` for ablation to limit number of collectives to reorder.
What is after this PR:
Iterative recomputation of memory estimations matches full memory estimations.
Active memory is not regressing a lot, but reserved memory is significantly regressed.
Investigation and fix of "reserved" memory will be in following PRs.
BASELINE (bucketing AG and RS): active: 32Gb reserved: 34Gb
```
[rank0]:[titan] 2025-08-11 11:28:36,798 - root - INFO - step: 1 loss: 12.2722 grad_norm: 4.2192 active_memory: 24.66GiB(25.96%) reserved_memory: 25.38GiB(26.72%) tps: 99 tflops: 5.71 mfu: 0.58%
[rank0]:[titan] 2025-08-11 11:28:38,640 - root - INFO - step: 2 loss: 13.1738 grad_norm: 50.5566 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 4,448 tflops: 257.63 mfu: 26.05%
[rank0]:[titan] 2025-08-11 11:28:40,029 - root - INFO - step: 3 loss: 15.6866 grad_norm: 80.0862 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,900 tflops: 341.72 mfu: 34.55%
[rank0]:[titan] 2025-08-11 11:28:41,423 - root - INFO - step: 4 loss: 13.4853 grad_norm: 7.8538 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,881 tflops: 340.57 mfu: 34.44%
[rank0]:[titan] 2025-08-11 11:28:42,820 - root - INFO - step: 5 loss: 16.1191 grad_norm: 53.2481 active_memory: 32.14GiB(33.83%) reserved_memory: 34.21GiB(36.01%) tps: 5,867 tflops: 339.77 mfu: 34.35%
```
REORDER: active: 32Gb reserved: 36Gb
```
[rank0]:[titan] 2025-08-11 11:34:32,772 - root - INFO - step: 1 loss: 12.2490 grad_norm: 4.1944 active_memory: 24.66GiB(25.96%) reserved_memory: 26.81GiB(28.22%) tps: 85 tflops: 4.90 mfu: 0.50%
[rank0]:[titan] 2025-08-11 11:34:35,329 - root - INFO - step: 2 loss: 13.1427 grad_norm: 39.5942 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 3,205 tflops: 185.61 mfu: 18.77%
[rank0]:[titan] 2025-08-11 11:34:36,770 - root - INFO - step: 3 loss: 14.6084 grad_norm: 51.0743 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,688 tflops: 329.44 mfu: 33.31%
[rank0]:[titan] 2025-08-11 11:34:38,197 - root - INFO - step: 4 loss: 13.6181 grad_norm: 8.1122 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,744 tflops: 332.68 mfu: 33.64%
[rank0]:[titan] 2025-08-11 11:34:39,821 - root - INFO - step: 5 loss: 15.8913 grad_norm: 59.8510 active_memory: 32.14GiB(33.83%) reserved_memory: 36.40GiB(38.31%) tps: 5,046 tflops: 292.22 mfu: 29.55%
```
REORDER + SINK_WAITS_ITERATIVE: active: 35Gb reserved: 41Gb
```
[rank0]:[titan] 2025-08-11 11:31:36,119 - root - INFO - step: 1 loss: 12.2646 grad_norm: 4.1282 active_memory: 27.60GiB(29.05%) reserved_memory: 32.49GiB(34.20%) tps: 173 tflops: 10.00 mfu: 1.01%
[rank0]:[titan] 2025-08-11 11:31:37,452 - root - INFO - step: 2 loss: 13.2353 grad_norm: 42.4234 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,152 tflops: 356.26 mfu: 36.02%
[rank0]:[titan] 2025-08-11 11:31:38,780 - root - INFO - step: 3 loss: 13.8205 grad_norm: 24.0156 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,169 tflops: 357.29 mfu: 36.13%
[rank0]:[titan] 2025-08-11 11:31:40,106 - root - INFO - step: 4 loss: 13.1033 grad_norm: 9.1167 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,183 tflops: 358.10 mfu: 36.21%
[rank0]:[titan] 2025-08-11 11:31:41,443 - root - INFO - step: 5 loss: 16.3530 grad_norm: 51.8118 active_memory: 35.08GiB(36.92%) reserved_memory: 41.62GiB(43.80%) tps: 6,130 tflops: 355.03 mfu: 35.90%
```
Differential Revision: [D79886535](https://our.internmc.facebook.com/intern/diff/D79886535 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160113
Approved by: https://github.com/wconstab , https://github.com/eellison
Co-authored-by: eellison <elias.ellison@gmail.com>
2025-08-21 05:19:38 +00:00
IvanKobzarev
3ced1079a4
[inductor] Fix collectives_reordering overwrite real_dep with fake_dep with the same name ( #158960 )
...
Differential Revision: [D78839734](https://our.internmc.facebook.com/intern/diff/D78839734 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158960
Approved by: https://github.com/wconstab
2025-07-24 11:08:58 +00:00
IvanKobzarev
eeb0783fe6
[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative ( #158062 )
...
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-17 20:04:42 +00:00
PyTorch MergeBot
4f36743f5e
Revert "[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative ( #158062 )"
...
This reverts commit 5a54db14e3 .
Reverted https://github.com/pytorch/pytorch/pull/158062 on behalf of https://github.com/clee2000 due to sorry I want to revert something else and this is causing a merge conflict, all you should need to do is rebase and remerged ([comment](https://github.com/pytorch/pytorch/pull/158062#issuecomment-3074342140 ))
2025-07-15 16:31:13 +00:00
IvanKobzarev
5a54db14e3
[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative ( #158062 )
...
Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158062
Approved by: https://github.com/wconstab
2025-07-15 14:27:57 +00:00
Xuehai Pan
7f14b42adf
[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
PyTorch MergeBot
e15f4248ad
Revert "[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )"
...
This reverts commit 7a92b51196 .
Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250 ))
2025-07-12 04:40:52 +00:00
Xuehai Pan
7a92b51196
[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 01:47:22 +00:00
IvanKobzarev
8134684d44
[inductor collectives] sink waits iterative ( #157708 )
...
Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157708
Approved by: https://github.com/wconstab
ghstack dependencies: #157706
2025-07-08 07:17:10 +00:00
IvanKobzarev
2fde2090d0
[inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes ( #157706 )
...
Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706
Approved by: https://github.com/wconstab
2025-07-07 23:13:58 +00:00
Will Constable
382598ef87
Fix unsafe collective reorder past wait ( #157489 )
...
Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim
Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note: this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective. After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
2025-07-03 05:04:19 +00:00
Will Constable
dc524efb4d
Move logging into inner method for reorder pass ( #156879 )
...
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass. The logs should be in the inner part, so
they are functional also during unit testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
2025-07-03 05:04:19 +00:00
Shawn Xu
0364db7cd1
[PT] support custom all_gather and reduce_scatter comms ( #155189 )
...
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.
This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`
Related PR:
https://github.com/pytorch/pytorch/pull/150564
Test Plan: CI
Differential Revision: D75714362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
2025-07-02 06:58:45 +00:00
Xuehai Pan
6ff6630375
[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
PyTorch MergeBot
f1331f3f1b
Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )"
...
This reverts commit 3627270bdf .
Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912 ) [HUD commit link](c95f7fa874 ) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213 ))
2025-06-22 12:31:57 +00:00
Xuehai Pan
3627270bdf
[BE][3/16] fix typos in torch/ (torch/_inductor/) ( #156313 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
Luca Wehrstedt
0a0023d984
Enable NCCL zero-copy (user buffer registration) for FSDP2 ( #150564 )
...
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).
This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.
FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.
This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501 , https://github.com/weifengpy , https://github.com/syed-ahmed
2025-06-17 12:54:58 +00:00
Will Constable
0a6b66c881
Inductor comms reorder logs to tlparse ( #155737 )
...
Hacked test_inductor_collectives test to demonstrate this works:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/whc/de50ff33-f460-406b-bfa9-457e6e17395b/custom/-_0_0_0/reorder_communication_preserving_peak_memory_9.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Follow up: it would be nice to move the logging out of this pass and
into the broader comms pass loop, where the before/after each pass
visualization could be logged into the same tlparse file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155737
Approved by: https://github.com/bdhirsh
2025-06-13 02:59:42 +00:00
Xuan Zhang
72453a6676
[PT2][comms] put visualize_overlap in a try-except block ( #155222 )
...
Summary:
For simple FSDP, this `visualize_overlap` function is throwing errors.
Seems to be a mistake here since `visualize_overlap` is called twice here and one is in try-except and one is not, so doing the same for both places.
Test Plan:
:)
Rollback Plan:
Reviewed By: Microve
Bifferential Revision: D75985733
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155222
Approved by: https://github.com/yf225
2025-06-05 23:39:48 +00:00
Will Constable
136ee4c81b
Make assertion about pass callable print the bad pass ( #152654 )
...
If you passed an invalid string now you can easily see what it is
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152654
Approved by: https://github.com/eellison
2025-05-05 18:07:43 +00:00
Will Constable
1cd68c59dd
Remove incorrect assertion ( #152653 )
...
It's only aspirational that the 'improvement' value is positive. In fact
the pass could make a collective more exposed and we shouldn't assert
here in that case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152653
Approved by: https://github.com/eellison
ghstack dependencies: #152565
2025-05-03 02:33:58 +00:00
Will Constable
53bf174626
Fix assertion in reorder_communication_preserving_peak_memory ( #152565 )
...
>=0 is practically correct becuase we do model the runtime of some ops as 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152565
Approved by: https://github.com/eellison
2025-05-01 06:40:04 +00:00
Will Constable
a4a771648a
[pt2d] Add reorder_comms_preserving_peak_memory pass ( #146562 )
...
This is a new pass to replace the pre-existing passes. It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.
The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
not increaes curr memory beyond peak memory
The pass logs a summary table for each graph to TORCH_LOGS=overlap.
e.g. (exact format may have been tweaked but this shows the idea).
```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4
[rank
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060 , #146561
2025-04-29 22:51:31 +00:00
Will Constable
4b61564252
Include CollectiveKernel in inductor debug visualization ( #146561 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146561
Approved by: https://github.com/eellison
ghstack dependencies: #152060
2025-04-29 00:53:38 +00:00
Will Constable
119f64d0eb
Add 'step' counter to visualize_overlap log ( #152060 )
...
Example of log after the change:
```
[rank0]:V0227 15:07:20.704000 1594243 torch/_inductor/comms.py:621] [0/0] [__overlap] ==== Visualize overlap after reordering pass <function group_copy_collective at 0x7f41c1922050> (ran in 0.026380538940429688 sec)====
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 0: GroupedSchedulerNode(name='op6_op7') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 1: GroupedSchedulerNode(name='op55_op56') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.705000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 2: GroupedSchedulerNode(name='op75_op76') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 3: GroupedSchedulerNode(name='op121_op122') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 4: GroupedSchedulerNode(name='op141_op142') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 5: GroupedSchedulerNode(name='op187_op188') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.706000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 6: GroupedSchedulerNode(name='op207_op208') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 7: GroupedSchedulerNode(name='op253_op254') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 8: GroupedSchedulerNode(name='op273_op274') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
[rank0]:V0227 15:07:20.707000 1594243 torch/_inductor/comms.py:569] [0/0] [__overlap] 9: GroupedSchedulerNode(name='op319_op320') (size=[512], stride=[1]), (size=[4096], stride=[1]) () (0 ns)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152060
Approved by: https://github.com/eellison
2025-04-28 23:23:21 +00:00
Simon Fan
159d8a14a6
[inductor][comms] fix node_summary for composite scheduler nodes ( #150258 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150258
Approved by: https://github.com/yf225
2025-04-16 10:18:33 +00:00
Xuehai Pan
1cb4e2df65
[BE][PYFMT] migrate PYFMT for torch._inductor to ruff format ( #144550 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
Aaron Orenstein
893ca1dfe1
PEP585 update - torch/_inductor/[_-i]* ( #145137 )
...
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137
Approved by: https://github.com/bobrenjc93
2025-01-19 01:22:47 +00:00
Will Feng
be5b342332
[Inductor] Move peak memory pass and overlap pass to be run at the right place ( #142822 )
...
This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap.
This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822
Approved by: https://github.com/eellison
2024-12-14 06:53:02 +00:00
Tom Ritchford
da67a6a7bb
[inductor] Replace set by OrderedSet ( #138466 )
...
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
Xuan Zhang
ed388394d1
add torchrec collectives to enforce global ordering ( #141970 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970
Approved by: https://github.com/yf225
2024-12-11 02:45:24 +00:00
Andrew Gu
78425bff30
[FSDP2] Move to public torch.distributed.fsdp ( #141868 )
...
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`
Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501 , https://github.com/wconstab , https://github.com/weifengpy , https://github.com/fegin , https://github.com/XilunWu
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
PyTorch MergeBot
bab15df40a
Revert "[FSDP2] Move to public torch.distributed.fsdp ( #141868 )"
...
This reverts commit 45583a5df9 .
Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180 ))
2024-12-06 18:38:12 +00:00
Andrew Gu
45583a5df9
[FSDP2] Move to public torch.distributed.fsdp ( #141868 )
...
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.
**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501 , https://github.com/wconstab , https://github.com/weifengpy
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
Jason Ansel
6eca0aee76
[inductor] Refactor ir.Layout into ir.OutputSpec ( #140910 )
...
This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs). Which should make typing easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910
Approved by: https://github.com/ezyang
ghstack dependencies: #140895
2024-11-21 20:01:57 +00:00
Jason Ansel
808f0f656d
[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang , https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
PyTorch MergeBot
d472a5f680
Revert "[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )"
...
This reverts commit c79e78b503 .
Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319 ))
2024-11-19 04:25:41 +00:00
Jason Ansel
c79e78b503
[inductor] Refactor MutableBox to make IRNode typing easier ( #140895 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang , https://github.com/Skylion007
2024-11-19 00:24:35 +00:00
Will Feng
94d2471d1f
[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA ( #133730 )
...
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).
This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.
------
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
2024-09-11 23:01:05 +00:00
Oguz Ulgen
09f9c256ad
Add basic mypy annotations to inductor ( #132416 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan , https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
PyTorch MergeBot
f2ddd5e9e0
Revert "Add basic mypy annotations to inductor ( #132416 )"
...
This reverts commit 78927d37f6 .
Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785 ))
2024-08-04 18:39:29 +00:00