PyTorch MergeBot
8f40a0c634
Revert "address DDE in matmul decomp ( #166541 )"
...
This reverts commit 90519402c2 .
Reverted https://github.com/pytorch/pytorch/pull/166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](https://github.com/pytorch/pytorch/pull/166541#issuecomment-3469382334 ))
2025-10-30 18:11:33 +00:00
Laith Sakka
90519402c2
address DDE in matmul decomp ( #166541 )
...
Address https://github.com/pytorch/pytorch/issues/165081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166541
Approved by: https://github.com/mlazos
2025-10-30 03:19:29 +00:00
Maggie Moss
31e42eb732
Fix pyrefly ignore syntax ( #166438 )
...
Reformats pyrefly ignore suppressions so they only ignore one error code.
pyrefly check
lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166438
Approved by: https://github.com/Skylion007
2025-10-29 00:02:21 +00:00
Gufan Yin
e6ba4d0725
Back out "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )" ( #165910 )
...
Summary:
Original commit changeset: d6d62d0c96dd
Original Phabricator Diff: D84468451 and D84613184
D84468451 caused CUDA OutOfMemoryError in model.
Test Plan:
D84468451 was found through bisect. Also double checked on recent trunk 9866939225248c2adc307be7a804b26db0b9b555: f815887517
With this diff that backs out D84468451 and D84613184 : f816114560
Differential Revision: D85025378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165910
Approved by: https://github.com/clee2000
2025-10-21 16:36:38 +00:00
Yuanyuan Chen
fdab48a7c1
Enable all PIE rules on ruff ( #165814 )
...
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
PyTorch MergeBot
24520b8386
Revert "Enable all PIE rules on ruff ( #165814 )"
...
This reverts commit c79dfdc655 .
Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863 ))
2025-10-18 07:21:08 +00:00
Yuanyuan Chen
c79dfdc655
Enable all PIE rules on ruff ( #165814 )
...
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
Edward Yang
08f09d9543
Ensure rms_norm decomp generates add.Scalar for pattern match BC ( #165437 )
...
Summary: Apparently if I just do `tensor + eps` this turns into add.Tensor, which is bad because the constant Tensor ends up getting hoisted into an input, which is a bozo thing to do. Just make sure it's exactly compatible.
Test Plan:
```
buck run 'fbcode//mode/opt' fbcode//bolt/nn/executorch/backends/tests:qnn_test_ar1g1 bolt.nn.executorch.backends.tests.qnn_test_ar1g1.QnnTestAR1G1.test_RMSNorm
```
Reviewed By: tugsbayasgalan
Differential Revision: D84613184
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165437
Approved by: https://github.com/tugsbayasgalan
2025-10-14 19:56:37 +00:00
Yuanyuan Chen
8de85896e0
Enable ruff rule E721 ( #165162 )
...
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-13 01:48:55 +00:00
PyTorch MergeBot
816fb7f48d
Revert "Enable ruff rule E721 ( #165162 )"
...
This reverts commit 9e7c19f72b .
Reverted https://github.com/pytorch/pytorch/pull/165162 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165162#issuecomment-3393328271 ))
2025-10-11 13:25:40 +00:00
Yuanyuan Chen
9e7c19f72b
Enable ruff rule E721 ( #165162 )
...
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-11 06:43:53 +00:00
Edward Z. Yang
de8d81275a
Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )
...
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-11 01:03:55 +00:00
PyTorch MergeBot
5c3fe9fb30
Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )"
...
This reverts commit a6fa4f9c28 .
Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/izaitsevfb due to introduces numeric issues internally, see [D84326613](https://www.internalfb.com/diff/D84326613 ) ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3392203314 ))
2025-10-10 20:21:12 +00:00
Edward Z. Yang
a6fa4f9c28
Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )
...
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-10 00:15:00 +00:00
PyTorch MergeBot
06d86e58d0
Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )"
...
This reverts commit d40a9bfb8d .
Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3385056722 ))
2025-10-09 09:50:59 +00:00
Edward Z. Yang
d40a9bfb8d
Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed ( #164939 )
...
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
2025-10-09 04:49:44 +00:00
Laith Sakka
7158aa22e8
remove more ( #164753 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753
Approved by: https://github.com/aorenste , https://github.com/mlazos
ghstack dependencies: #164664 , #164665 , #164667 , #164668
2025-10-08 14:23:38 +00:00
Maggie Moss
086dec3235
Pyrefly suppressions 6/n ( #164877 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
Almost there!
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (5,064 ignored)
Only four directories left to enable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877
Approved by: https://github.com/oulgen
2025-10-08 02:30:57 +00:00
Maggie Moss
1051c1de5c
Add pyrefly suppressions 2/n ( #164513 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
---
step 1: uncomment lines in the `pyrefly.toml` file
before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46
after:
```
INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
INFO 0 errors (1,152 ignored)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164513
Approved by: https://github.com/oulgen
2025-10-03 02:46:13 +00:00
Yuanyuan Chen
a43c4c3972
[5/N] Apply ruff UP035 rule ( #164423 )
...
Continued code migration to enable ruff `UP035`. Most changes are about moving `Callable` from `typing` to `from collections.abc`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164423
Approved by: https://github.com/ezyang
2025-10-02 07:31:11 +00:00
Pian Pawakapan
474d07554a
[dynamic shapes] unbacked-safe slicing ( #161414 )
...
Summary:
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.
Test Plan:
contbuild & OSS CI, see 56218d85e2
Rollback Plan:
Differential Revision: D80948073
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414
Approved by: https://github.com/laithsakka
2025-09-30 01:15:19 +00:00
can-gaa-hou
eb4361a801
[Fix] Adding missing f prefixes to formatted strings [1/N] ( #164065 )
...
As stated in the title.
* #164068
* #164067
* #164066
* __->__ #164065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065
Approved by: https://github.com/Skylion007
2025-09-29 04:53:00 +00:00
Jason Ansel
d746b987d8
[inductor] Fix divmod error in decomp ( #163482 )
...
Fixes #163457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163482
Approved by: https://github.com/eellison
ghstack dependencies: #163386 , #163398 , #163387 , #163414 , #163415 , #163419 , #163434 , #163393 , #163412 , #163422 , #163481 , #163520
2025-09-24 02:52:36 +00:00
Colin Peppler
3c8b90542c
support unbacked softmax / logsoftmax ( #162216 )
...
### DDE
```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*u0, 0) (unhinted: Eq(3*u0, 0)). (Size-like symbols: u0)
Caused by: (_decomp/decompositions.py:1185 in _softmax)
```
```
torch._dynamo.exc.UserError: Could not guard on data-dependent expression Eq(u0, 0) (unhinted: Eq(u0, 0)). (Size-like symbols: u0)
Caused by: logsoft = torch.nn.functional.log_softmax(nz, dim=0) # test/inductor/test_unbacked_symints.py:573 in fn (_decomp/decompositions.py:1212 in _log_softmax)
```
```
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(u0, 0) (unhinted: Ne(u0, 0)). (Size-like symbols: u0)
Caused by: (_refs/__init__.py:2218 in _reduction)
```
### Cannot convert symbols to int
```
File "torch/_inductor/lowering.py", line 7160, in prepare_softmax_online
and V.graph.sizevars.size_hint(rnumel) >= config.unroll_reductions_threshold
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "orch/_inductor/sizevars.py", line 591, in size_hint
return int(out)
^^^^^^^^
File "sympy/core/expr.py", line 342, in __int__
raise TypeError("Cannot convert symbols to int")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162216
Approved by: https://github.com/laithsakka , https://github.com/eellison
2025-09-18 15:43:20 +00:00
Shaobin Ma
63276edb7c
[Inductor] support mixed dtype in the native_layer_norm_backward meta function ( #159830 )
...
Fixes #159829
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159830
Approved by: https://github.com/albanD
2025-09-17 20:29:12 +00:00
PyTorch MergeBot
00e9ba75cd
Revert "[indexing] Prevent integer overflow from large step values in C++ ( #161707 )"
...
This reverts commit c140bf217f .
Reverted https://github.com/pytorch/pytorch/pull/161707 on behalf of https://github.com/huydhn due to Look like there is a land race as lots of jobs are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/161707#issuecomment-3283980465 ))
2025-09-12 06:49:36 +00:00
thenumberouscode
c140bf217f
[indexing] Prevent integer overflow from large step values in C++ ( #161707 )
...
Fixes https://github.com/pytorch/pytorch/issues/160868
hmmm, I found an existing fix PR after I've finished this one. For reference, the old PR was
https://github.com/pytorch/pytorch/pull/147433/files .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161707
Approved by: https://github.com/leslie-fang-intel , https://github.com/CaoE , https://github.com/mlazos
2025-09-12 03:16:23 +00:00
PyTorch MergeBot
3f1a97a99c
Revert "[dynamic shapes] unbacked-safe slicing ( #157944 )"
...
This reverts commit 44549c7146 .
Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/pianpwk due to this PR & internal diff landed out of sync, just reverted internal with D80720654, will revert this & reland as codev ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3215610135 ))
2025-08-22 20:48:46 +00:00
Pian Pawakapan
44549c7146
[dynamic shapes] unbacked-safe slicing ( #157944 )
...
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-20 22:52:56 +00:00
PyTorch MergeBot
6ea4be1e2e
Revert "[dynamic shapes] unbacked-safe slicing ( #157944 )"
...
This reverts commit 2f0cba934d .
Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/seemethere due to This is blocking internal sync due to merge conflicts ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3206833193 ))
2025-08-20 15:16:45 +00:00
Pian Pawakapan
2f0cba934d
[dynamic shapes] unbacked-safe slicing ( #157944 )
...
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-19 17:32:47 +00:00
PyTorch MergeBot
5e98d9f9ba
Revert "[dynamic shapes] unbacked-safe slicing ( #157944 )"
...
This reverts commit 56218d85e2 .
Reverted https://github.com/pytorch/pytorch/pull/157944 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think this is failing test_draft_export in trunk 56218d85e2 ([comment](https://github.com/pytorch/pytorch/pull/157944#issuecomment-3198874677 ))
2025-08-19 01:16:17 +00:00
Pian Pawakapan
56218d85e2
[dynamic shapes] unbacked-safe slicing ( #157944 )
...
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157944
Approved by: https://github.com/laithsakka
2025-08-18 22:38:16 +00:00
Laith Sakka
f782c790df
migrate more simple gso checks ( #160253 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160253
Approved by: https://github.com/bobrenjc93
2025-08-16 00:15:24 +00:00
Colin Peppler
46d34d6766
(should_fold) gso to guard_or_false when checking folding whether to 3d bmm into 2d mm ( #159184 )
...
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.
806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)
## DDE
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4472 in should_fold)
```
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4483 in should_fold)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
2025-07-30 03:12:14 +00:00
Pian Pawakapan
48fe4ff247
[export] set enable_gqa in export flash->math decomp ( #158604 )
...
Differential Revision: D78524147
For `scaled_dot_product_attention(..., enable_gqa=True)`:
- the Math backend passes the flag through, performing the extra [KV broadcast](6e07d6a0ff/aten/src/ATen/native/transformers/attention.cpp (L902) ) if set to True
- the Flash backend has no flag, and relies on correct indexing in the C++ kernel
- Export used to default to Math for `enable_gqa=True`, but https://github.com/pytorch/pytorch/pull/157893 landed and enabled Flash. At the same time, there's an export-only [decomp](6e07d6a0ff/torch/_decomp/decompositions.py (L4968) ) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs.
This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158604
Approved by: https://github.com/angelayi , https://github.com/drisspg
2025-07-24 14:46:13 +00:00
Colin Peppler
a6b7bea244
[inductor] support linear & layer_norm unbacked ( #155267 )
...
### What
- Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path.
- For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`).
### Example DDE
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0)
Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes)
```
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0)
Caused by: (_inductor/ir.py:2797 in create)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267
Approved by: https://github.com/eellison
2025-07-23 05:42:01 +00:00
AaronWang04
04a393507b
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/albanD
2025-07-22 22:25:44 +00:00
PyTorch MergeBot
35f1b4ad9e
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit 15ef4f28df .
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts . See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935 ))
2025-07-21 17:31:42 +00:00
AaronWang04
15ef4f28df
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/eqy , https://github.com/albanD
2025-07-18 23:24:21 +00:00
Xuehai Pan
7f14b42adf
[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 05:47:06 +00:00
PyTorch MergeBot
e15f4248ad
Revert "[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )"
...
This reverts commit 7a92b51196 .
Reverted https://github.com/pytorch/pytorch/pull/156312 on behalf of https://github.com/XuehaiPan due to landrace ([comment](https://github.com/pytorch/pytorch/pull/156312#issuecomment-3064672250 ))
2025-07-12 04:40:52 +00:00
Xuehai Pan
7a92b51196
[BE][2/16] fix typos in torch/ (torch/_*/) ( #156312 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156312
Approved by: https://github.com/albanD
2025-07-12 01:47:22 +00:00
PyTorch MergeBot
c553c55be7
Revert "Fix full_like decomposition to preserve strides ( #144765 )"
...
This reverts commit 01b0f09931 .
Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778 ), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098 ))
2025-07-02 13:56:03 +00:00
Isuru Fernando
01b0f09931
Fix full_like decomposition to preserve strides ( #144765 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames , https://github.com/jansel
2025-07-01 19:13:22 +00:00
PyTorch MergeBot
6401d1d53d
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit e1aee86646 .
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001 ) [HUD commit link](e1aee86646 ) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176 ))
2025-07-01 18:46:45 +00:00
AaronWang04
e1aee86646
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
Tom Ritchford
e2c9d8d641
Fix non-bitwise type annotations for Tensor operators (see #145838 ) ( #146845 )
...
Fix https://github.com/pytorch/pytorch/issues/145838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-06-24 15:41:34 +00:00
Chao Gu
338a8c7853
fix slice w/ dynamic shapes ( #153131 )
...
Summary: guard_size_oblivious has side effects that'll result in invalid strides when slice nodes take negative index on dynamic input shapes.
Cause overflow error with a huge number “9223372036854776048”
Test Plan: CIs should pass.
Differential Revision: D74354663
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153131
Approved by: https://github.com/laithsakka
2025-06-13 15:53:17 +00:00
Shangdi Yu
3e05a48927
Fix clamp type promotion in inductor decomposition ( #154471 )
...
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.
Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```
Differential Revision: D75490124
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire , https://github.com/chenyang78
2025-05-28 23:24:25 +00:00