Commit Graph

33948 Commits

Author SHA1 Message Date
sanchitintel
7482eb217c [Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373)
### Summary

Fixes #148494

Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension.

Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching.

\* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance

### Performance data with BS 1

Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine

| Model | input tokens | output tokens | next-token latency before this PR | Next-token latency after this change | Speedup |
|-----------|-------------|-----------------|--------------------------------------|------------------------------------------|-----------|
|GPT-J | 128 | 128 | 42 ms | 38 ms | 9.52 % |
| GPT-J | 1024 | 1024 | 48 ms | 45 ms | 6.25 % |
|LLaMA 3.1 8B Instruct | 128 | 128 | 52 ms | 47 ms|  9.61% |
|LLaMA 3.1 8B Instruct | 1024 | 1024 | 57 ms | 53 ms|  7.01% |

While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373
Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen

Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>
2025-05-15 11:55:58 +00:00
Pat Vignola
6e107899da [Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508)
Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices.

Test Plan: CI

Differential Revision: D74539233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508
Approved by: https://github.com/exclamaforte
2025-05-15 08:41:46 +00:00
Simon Fan
b297e01f4b [ca][dtensor] run real PG dtensor tests under CA (#152689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689
Approved by: https://github.com/bdhirsh
ghstack dependencies: #153300
2025-05-15 08:10:35 +00:00
Simon Fan
4863e5c843 [ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300)
I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`.

This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail.

This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA.

FIXES https://github.com/pytorch/pytorch/issues/127115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300
Approved by: https://github.com/jansel
2025-05-15 08:10:35 +00:00
PyTorch MergeBot
71027b13b2 Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)"
This reverts commit 881a598a1e.

Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))
2025-05-15 07:58:27 +00:00
Xia, Weiwen
55784be01b [Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411)
**Summary**
This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU).
The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`.
The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported).

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-05-15 06:23:01 +00:00
Nikita Shulga
d5ddc5ab20 [MPS] Fix float64 scalar tensor handling (#153582)
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor

Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
2025-05-15 05:15:14 +00:00
Mandar Deshpande
3e8bda4ad5 [pytorch][triton] flex attention fwd kernel with TMA loads (#151923) (#152460)
Summary:

Device side TMA for flex_attention fwd kernel, Q K V tensors

Test Plan:
Unit test:
```
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options
```
https://www.internalfb.com/intern/testinfra/testrun/14355223891618726

Differential Revision: D71082691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460
Approved by: https://github.com/drisspg
2025-05-15 04:49:32 +00:00
Daniel Vega-Myhre
881a598a1e [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357)
Fixes #147336

## Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403))).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

## Fix summary
- To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

## Benchmarks
Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us
```

After fix:
```
(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-05-15 02:41:38 +00:00
eellison
eaf2dee10e don't run triton mm for k<32 (#153550)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550
Approved by: https://github.com/suo

Co-authored-by: Natalia Gimelshein <ngimel@meta.com>
2025-05-15 02:36:44 +00:00
karthickai
725bbb6b5f [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel
2025-05-15 02:33:57 +00:00
Ting Lu
c2bc7e2827 API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536)
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.

For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t

Error we see without the change
```
RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`

To execute this test, run the following from the base repo dir:
    python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536
Approved by: https://github.com/jcaip, https://github.com/atalman
2025-05-14 23:36:53 +00:00
PyTorch MergeBot
f363a3f51a Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 9386701b51.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))
2025-05-14 20:53:49 +00:00
James Wu
dda2c7c8fc Pass inductor config for static cuda launcher to workers (#153382)
Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-05-14 20:01:32 +00:00
Ben Zickel
a54bf43baa Fix support of MixtureSameFamily [bugfix]. (#151317)
Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code:

```
from torch import rand
from torch.distributions import MixtureSameFamily, Categorical, Binomial

max_count = 20
probs = rand(10, 5)
binom_probs = rand(10, 5)

d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs))
d.log_prob(d.sample())
```

which results in:

```
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    d.log_prob(d.sample())
  File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob
    self._validate_sample(x)
  File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample
    valid = support.check(value)
            ^^^^^^^^^^^^^^^^^^^^
  File "pytorch\torch\distributions\constraints.py", line 307, in check
    (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1
```

### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons)

- The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation.
- The fix itself does not alter the calculations at all. It only affects the sample validation process.
- The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise).
- I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions.
- Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5).

### Updated fix explanation (for all cases)

- The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation.
- The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed.
- This issue was already anticipated in the [code](331423e5c2/torch/distributions/mixture_same_family.py (L127)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317
Approved by: https://github.com/albanD, https://github.com/fritzo
2025-05-14 19:24:36 +00:00
clr
534b66fe30 torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297)
tests

This config option is not set anywhere, and does nothing, so this should cause
no changes to tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297
Approved by: https://github.com/Skylion007
2025-05-14 19:02:51 +00:00
PyTorch MergeBot
bf0fe4f828 Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)"
This reverts commit ced90d23d3.

Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))
2025-05-14 18:52:07 +00:00
Nikita Shulga
8749fe8439 [CI][MPS] Speedup test_large_bmm (#153562)
By computing matmuls of only one random non-zero batch on CPU

This reduces test runtime from 11 minutes to 14 sec
```
 % python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok

----------------------------------------------------------------------
Ran 2 tests in 27.495s

```

TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
2025-05-14 18:49:42 +00:00
angelayi
47d6feff7c [export] Support no inputs in unflattened module (#153474)
Encountered in this diff D74589491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474
Approved by: https://github.com/avikchaudhuri
2025-05-14 18:45:47 +00:00
Ryan Guo
8bb67700a3 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-14 17:03:59 +00:00
Ryan Guo
6765df052c [dynamo] Emit warning on global module hooks when calling using output of torch.compile(module) (#152740)
When we do `torch.compile(module)`, we eventually end up returning a new
`OptimizedModule` instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) for the compiled module.

`OptimizedModule` also inherits `nn.module.__call__`, and thus
has its own hook logic. This is useful for torchao, which injects module
forward hooks to run in eager for quantization purposes.

However, this might create unexpected behavior for global module hooks,
because `torch.compile(module)` causes the hook to fire one extra time
for `OptimizedModule`, when compared to eager.

To preserve BC, we simply emit a warning for this behavior, and let
users decide what to do. This is reasonable because the global module
hooks are documented to be used for debugging/profiling purposes only.

Fixes #149502

Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-05-14 17:03:59 +00:00
Shangdi Yu
b3dea0c0dd Change aoti cpp tests to run serially within file (#152960)
Fixes #152674
https://github.com/pytorch/pytorch/issues/152889
https://github.com/pytorch/pytorch/issues/152888
https://github.com/pytorch/pytorch/issues/152891

`--dist=loadfile` ensures all tests in the same source file run in the same worker.

Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE).

With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state.

```
 python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960
Approved by: https://github.com/desertfire, https://github.com/cyyever
2025-05-14 17:02:39 +00:00
Anthony Shoumikhin
7d39e73c57 Fix more URLs (#153277)
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
fengqing.lu
de92296bbb [Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976)
Fix https://github.com/pytorch/pytorch/issues/152290.

The model **hubert** uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN.  This PR handles this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-05-14 16:09:03 +00:00
Shangdi Yu
2e440e39a6 [nativert] Move Placement to pytorch core (#152953)
Summary:
Move Placement to pytorch core.

Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace.

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:placement_test

./bin/test_nativert
```

OSS and internal CI

Differential Revision: D74190745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953
Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever
2025-05-14 15:26:54 +00:00
eqy
ced90d23d3 [CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101)
For #152816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101
Approved by: https://github.com/Skylion007
2025-05-14 15:22:47 +00:00
PyTorch MergeBot
2344eca5eb Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)"
This reverts commit ee096b89f6.

Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))
2025-05-14 13:15:03 +00:00
PyTorch MergeBot
2c1912452d Revert "Rewrite autograd producer consumer stream sync logic (#151079)"
This reverts commit f78e4529a9.

Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))
2025-05-14 13:07:12 +00:00
Animesh Jain
864a5f4434 [dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42
2025-05-14 09:26:26 +00:00
Wanchao Liang
4c5cf18ee0 [device_mesh] improve device selection logic (#150897)
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

* If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
* If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

* If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
* If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
* If not above, then we throw warning to users about situation, and fallback to the old heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897
Approved by: https://github.com/tianyu-l
ghstack dependencies: #150898
2025-05-14 06:29:16 +00:00
Bin Bao
33a5179269 [AOTI][reland2] Remove typedef for half and bfloat16 (#153467)
Summary:
Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues.

typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen.

Differential Revision: D74398762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467
Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever
2025-05-14 02:37:18 +00:00
Meet Patel
9ad9a04ca7 Add TensorLR variant for fused Adagrad on CPU (#153078)
This PR adds a tensor LR variant for the CPU Adagrad(fused=True).

I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078
Approved by: https://github.com/janeyx99
2025-05-14 02:23:33 +00:00
angelayi
d51bc27378 [export] Make draft_export public (#153219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219
Approved by: https://github.com/pianpwk
2025-05-14 02:18:36 +00:00
clr
85f97b5a8c compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983)
This is a pretty minor change, but by having exact correspondence, we can
easily confirm data differences between perfetto and wait counters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983
Approved by: https://github.com/jansel, https://github.com/masnesral
2025-05-14 01:54:42 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
Georg Narodoslawsky
8739a8c288 elastic: do not shutdown rendezvous on leaving workers (#152525)
In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](fa6f9eb2be/torch/distributed/launcher/api.py (L290)) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749).

#124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before.

Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving.

Fixes #150916
Fixes #147064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525
Approved by: https://github.com/kiukchung
2025-05-14 00:44:10 +00:00
Pian Pawakapan
8ac82c3e72 [export] support functools.partial forward (non-strict) (#153408)
Fixes #153086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408
Approved by: https://github.com/tugsbayasgalan
2025-05-13 23:30:13 +00:00
Ryan Guo
8ac82a1d20 [dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416)
This adds a regression test for #149831, also as part of getting it
cherry-picked into 2.7.1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416
Approved by: https://github.com/atalman
2025-05-13 18:28:02 +00:00
Wanchao Liang
9df9d9ded0 [device_mesh] replace dim_group_info with group_name (#150898)
as titled, there's no need to maintain a dim_group_info anymore, we can
simply maintain a list of group_name instead. This will simplify the
logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-05-13 17:16:45 +00:00
Sam Larsen
dde705864a Fix test broken by D73809989 (#153413)
Summary: I forgot to remove this unused field in D73809989.

Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413
Approved by: https://github.com/c00w
2025-05-13 16:44:30 +00:00
Simon Fan
216e28f7e9 [ca] run xfails up until their last passing backend (#153279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279
Approved by: https://github.com/jansel
ghstack dependencies: #153193, #153222
2025-05-13 16:42:10 +00:00
Simon Fan
a80eb84a5f [ca] support higher order gradients (create_graph=True) (#153222)
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").

Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
2025-05-13 16:42:09 +00:00
Simon Fan
37efaf4af9 [ca][api] config api shouldn't error with optimize_assert (#153193)
Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193
Approved by: https://github.com/jansel
2025-05-13 16:42:02 +00:00
Guilherme Leobas
a4459cd4e3 Remove property from python_type function (#152900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900
Approved by: https://github.com/amjames, https://github.com/anijain2305
ghstack dependencies: #153070
2025-05-13 16:26:25 +00:00
Prachi Gupta
c5ebc12f7f [ROCm] unkip test_non_standard_bool except for failings ops (#152956)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152956
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-05-13 15:55:42 +00:00
Animesh Jain
7fdd754136 [compile-time traces] Profile large missing gaps in compile time (#151256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256
Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel
2025-05-13 14:44:51 +00:00
Wang, Eikan
ee096b89f6 Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151315
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-05-13 14:44:17 +00:00
Howard Huang
d9ef1012db [PP] Optimize memory usage by releasing output memory earlier (#153383)
Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-05-13 14:42:38 +00:00
Zhang, Jianyi
9f98e37eb4 [Intel GPU] add tf32 support for matmul on XPU (#144240)
Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only
~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570
We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240
Approved by: https://github.com/EikanWang
2025-05-13 14:03:01 +00:00
Michael Lazos
ff039d39ec [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-13 12:17:59 +00:00
Michael Lazos
a415c9831f [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-13 12:17:59 +00:00
Michael Lazos
57dafb90ef [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-13 12:17:59 +00:00
Michael Lazos
118192011e [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-13 12:17:59 +00:00
Michael Lazos
3592cb52d9 [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-13 12:17:59 +00:00
Michael Lazos
023a3dc69f [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-13 12:17:59 +00:00
nikitaved
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
PyTorch MergeBot
8d7dec6e92 Revert "[DSD] Don't pop tensors if they are on Meta device (#153185)"
This reverts commit 7243c69421.

Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))
2025-05-13 09:13:27 +00:00
FFFrog
29c8ae825f [OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309)
As the title stated.

**Next Chages**:
- Migrate remaining functionality to OpenReg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153309
Approved by: https://github.com/albanD
2025-05-13 03:49:19 +00:00
Nikita Shulga
a6c5b59067 [MPSInductor] Fix multistage reduction suffixes (#153362)
By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache.

Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device half* out_ptr1,
    constant half* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[32];
    float tmp_acc_1 = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp0 = static_cast<float>(in_ptr0[r0_0]);
        tmp_acc_1 += tmp0;
    }
    auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024);
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        auto tmp2 = static_cast<float>(in_ptr0[r0_0]);
        auto tmp3 = tmp2 / tmp1;
        out_ptr1[r0_0] = static_cast<half>(tmp3);
    }
}
```

Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps`

Fixes https://github.com/pytorch/pytorch/issues/152155

Though inductor tests are still failing, need to keep refining the variable invalidation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel
2025-05-13 03:07:53 +00:00
Laith Sakka
8b507a9809 convert guard_size_oblivious to runtime check in infer_size_impl (#148872)
its ok to check the requirement  numel == newsize at runtime in case of unbacked instead of at compile time and assume that its true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148872
Approved by: https://github.com/bobrenjc93
2025-05-13 00:32:28 +00:00
Natalia Gimelshein
0cf61ca7e4 make use_mem_pool threadlocal (#153356)
Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356
Approved by: https://github.com/Skylion007, https://github.com/eellison
2025-05-13 00:16:07 +00:00
Gabriel Ferns
71c8231742 fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066)
Summary:
https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention

Aligns the arguments for the triton inputs

Differential Revision: D74085173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066
Approved by: https://github.com/jansel
2025-05-12 23:56:49 +00:00
PyTorch MergeBot
641e4bee67 Revert "[export][cond] support merging constant ints as unbacked symint (#152742)"
This reverts commit a805911d15.

Reverted https://github.com/pytorch/pytorch/pull/152742 on behalf of https://github.com/ydwu4 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/152742#issuecomment-2874410372))
2025-05-12 23:06:33 +00:00
Benjamin Glass
b0f2891e43 [AOTInductor] Fix clang-tidy warnings in wrapper (#153197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197
Approved by: https://github.com/desertfire
2025-05-12 22:35:59 +00:00
Shivam Raikundalia
dbb4444ce3 [Memento] Add PT2 to Memory Snapshot (#152707)
Summary:
To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following:

1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack
2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected
3. Piping for compile context to pickle output

Test Plan:
In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658}

Differential Revision: D74028214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707
Approved by: https://github.com/eqy
2025-05-12 21:12:51 +00:00
soulitzer
f78e4529a9 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-12 21:07:16 +00:00
Yidi Wu
a805911d15 [export][cond] support merging constant ints as unbacked symint (#152742)
@pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)) with the following pattern:
```python
idx = if u0 return 0 else return 1
return  x[idx]
```
We could preserve the conditional with a cond.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742
Approved by: https://github.com/zou3519
2025-05-12 20:26:31 +00:00
Menglu Yu
88a068f33b [2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770)
Summary:
Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version.

Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it.

Test Plan:
### how to enable

```
    torch._inductor.config.post_grad_fusion_options = {
        "activation_quantization_aten_pass": {
            "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type
            "use_scaling": False,  -> default is False, if you want to use scaling verison, set it to True
            "size_in_mb": 0.0,  -> default is 100, you can tune the value.
             "exclude_primals": False, -> whether want to exclude quantize parameters, default is False
              "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16
        },
    }
```

### toy model

```
buck2 run mode/opt //scripts/qyz/autoac:quantization
```

```
Epoch [80/200], Loss: 19227.2109
Epoch [100/200], Loss: 1353.5272
Epoch [120/200], Loss: 38630.6758
Epoch [140/200], Loss: 6239.9155
Epoch [160/200], Loss: 6039.1567
Epoch [180/200], Loss: 3994.3569
Epoch [200/200], Loss: 146.3966
```

Differential Revision: D73015996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770
Approved by: https://github.com/Mingming-Ding
2025-05-12 19:43:18 +00:00
Aaron Gokaslan
3555ebb63d [BE]: Update ruff to 0.11.8 (#153249)
Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere
2025-05-12 18:30:52 +00:00
PyTorch MergeBot
5c3fddb9cc Revert "[Hierarchical Compilation] Track node mutations (#152389)"
This reverts commit c2936ebfd5.

Reverted https://github.com/pytorch/pytorch/pull/152389 on behalf of https://github.com/jeanschmidt due to Humm, interesting, there seems to be a bug in stack PRs, as it should be part of the stack and be reverted with the other ones ([comment](https://github.com/pytorch/pytorch/pull/152389#issuecomment-2873540451))
2025-05-12 18:18:44 +00:00
Chien-Chin Huang
498f364518 Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286)
The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this.

Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286
Approved by: https://github.com/kwen2501
2025-05-12 17:38:49 +00:00
xinan.lin
dc47295dc5 [Inductor UT][Break XPU] Generalize newly added device-bias code in Inductor UT. (#153355)
Fixes #153123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153355
Approved by: https://github.com/desertfire, https://github.com/Skylion007
2025-05-12 15:53:05 +00:00
rzou
89aa6eb19b Stop codegen-ing post_grad_custom_pass in repros (#153243)
When codegen'ed, it looks like:
```py
post_grad_custom_pass = <object at 0x12345678>
```
Which is not runnable at all. Some logic is also trying to deepcopy the
object, and not all of these objects are deepcopy-able.

This PR skips codegenning of these passes.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243
Approved by: https://github.com/houseroad
2025-05-12 15:21:11 +00:00
Colin Peppler
7657d80a58 [aoti] when generating example input shapes, use unbacked replacements (#153220)
## Context
Suppose we have this graph like this :
```
a: "[s1 + u2, 200]"
b: "[u0, 32]"
cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1)
```

NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)."

So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian.

Notice how a  has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value.

```
# s1 is an arbitrary integer
a = generate_example_value(size=(s1 + 8192, 200))
b = generate_example_value(size=(8192, 32))
out = generate_example_value(size=(s1 + 8192, 232))
triton_cat.run(a, b, out ...)
```

## Error
```
wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.
...
wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed.

RuntimeError: CUDA error: device-side assert triggered
```

Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220
Approved by: https://github.com/desertfire
2025-05-12 15:20:57 +00:00
PyTorch MergeBot
78d752e96a Revert "[Hierarchical Compilation] Use universal flatten APIs (#152505)"
This reverts commit f9e3a9058e.

Reverted https://github.com/pytorch/pytorch/pull/152505 on behalf of https://github.com/jeanschmidt due to [TENTATIVE] reverting to check if reverting this stack partially caused the introduction of https://github.com/pytorch/pytorch/actions/runs/14966121510/job/42049638969#step:22:875 ([comment](https://github.com/pytorch/pytorch/pull/152505#issuecomment-2872869990))
2025-05-12 14:48:08 +00:00
soulitzer
cb35a2b15d Add missing in-place on view check to custom autograd.Function (#153094)
Fixes https://github.com/pytorch/pytorch/issues/152773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094
Approved by: https://github.com/albanD
ghstack dependencies: #153005
2025-05-12 14:42:46 +00:00
zhxchen17
a67dd2083c [dynamo] Guard serialization for SHAPE_ENV (#153258)
Differential Revision: [D74483150](https://our.internmc.facebook.com/intern/diff/D74483150/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153258
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256, #153257
2025-05-12 14:42:01 +00:00
zhxchen17
e2f6870c98 [dynamo] Guard serialization for DEFAULT_DEVICE (#153257)
Differential Revision: [D74483147](https://our.internmc.facebook.com/intern/diff/D74483147/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153257
Approved by: https://github.com/jansel
ghstack dependencies: #153255, #153256
2025-05-12 14:42:00 +00:00
zhxchen17
ef1dcc21ee [dynamo] Guard serialization for global state guards (GRAD_MODE, DETERMINISTIC_ALGORITHMS, TORCH_FUNCTION_STATE, FSDP_TRAINING_STATE) (#153256)
serialization for global state guards.

Differential Revision: [D74483149](https://our.internmc.facebook.com/intern/diff/D74483149/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153256
Approved by: https://github.com/jansel
ghstack dependencies: #153255
2025-05-12 14:41:53 +00:00
zhxchen17
0210986cc4 [dynamo] Guard serialization for EMPTY_NN_MODULE_HOOKS_DICT (#153255)
EMPTY_NN_MODULE_HOOKS_DICT

Differential Revision: [D74483148](https://our.internmc.facebook.com/intern/diff/D74483148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153255
Approved by: https://github.com/jansel
2025-05-12 14:41:44 +00:00
PyTorch UpdateBot
23ecd35a96 Update slow tests (#151207)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151207
Approved by: https://github.com/pytorchbot
2025-05-12 12:05:58 +00:00
PyTorch MergeBot
47df195065 Revert "[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)"
This reverts commit bc8b305eb8.

Reverted https://github.com/pytorch/pytorch/pull/152410 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
PyTorch MergeBot
0e36887209 Revert "[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)"
This reverts commit 779e647999.

Reverted https://github.com/pytorch/pytorch/pull/152506 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
PyTorch MergeBot
53ebcabb52 Revert "[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)"
This reverts commit 50df08eb5e.

Reverted https://github.com/pytorch/pytorch/pull/152570 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
PyTorch MergeBot
aa7fe6af41 Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589)"
This reverts commit b5f1345f72.

Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))
2025-05-12 07:15:09 +00:00
Chien-Chin Huang
7243c69421 [DSD] Don't pop tensors if they are on Meta device (#153185)
DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading.

This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185
Approved by: https://github.com/mori360
2025-05-12 07:04:59 +00:00
Aaron Gokaslan
032ef48725 [BE]: Add PEP621 project section to pyproject.toml (#153055)
Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling.

* Enables modern tooling like uv sync and better support for tools like poetry.
* Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version).
* Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments.
* This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels

These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging.

Without this, `uv sync` complains:
```
error: No `project` table found in: `pytorch/pyproject.toml`
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055
Approved by: https://github.com/ezyang
2025-05-12 02:16:07 +00:00
Yidi Wu
ceb009baee [map] always turn on dynamo for map (#152041)
Summary:
X-link: https://github.com/pytorch/executorch/pull/10409

Reland D72896450

Make map consistent with other control flow ops. After the change, map is able to support accessing closures in the map fn.

Test Plan: See existing tests.

Reviewed By: zou3519

Differential Revision: D73138427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152041
Approved by: https://github.com/zou3519
2025-05-12 02:10:08 +00:00
Zhengxu Chen
c51bdf5acf [export] Exporter API prototype. (#153205)
Summary: see inline code comments for documentation

Test Plan:
CI

buck2 test --flagfile fbcode//mode/opt fbcode//caffe2/test:test_export -- -r TestPackage

Differential Revision: D74426900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153205
Approved by: https://github.com/tugsbayasgalan
2025-05-11 14:20:09 +00:00
henrylhtsang
1f5cf19f56 [cutlass backend] Use src code to generate cutlass gemm name (#153006)
This shaves off 40s for at least small cases, since we don't have to recompile the kernel again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153006
Approved by: https://github.com/mlazos
2025-05-11 00:57:03 +00:00
PyTorch MergeBot
fdc387ec7c Revert "refine fp32 precision api (#125888)"
This reverts commit 4c11b26158.

Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))
2025-05-11 00:35:46 +00:00
haozhe.zhu
4c11b26158 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-05-10 11:13:04 +00:00
Michael Lazos
b5f1345f72 [Dynamo] Optimize dedupe region ancestor tracking (#152589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572
2025-05-10 08:27:56 +00:00
Michael Lazos
50df08eb5e [Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410, #152506
2025-05-10 08:27:45 +00:00
Michael Lazos
779e647999 [Hierarchical Compile] Take into account mutation deps in cycle detection (#152506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505, #152410
2025-05-10 08:27:31 +00:00
Michael Lazos
bc8b305eb8 [Hierarchical Compile] Add mutation dependencies to topological sorting (#152410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389, #152505
2025-05-10 08:27:19 +00:00
Michael Lazos
f9e3a9058e [Hierarchical Compilation] Use universal flatten APIs (#152505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505
Approved by: https://github.com/anijain2305
ghstack dependencies: #152389
2025-05-10 08:27:07 +00:00
Michael Lazos
c2936ebfd5 [Hierarchical Compilation] Track node mutations (#152389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389
Approved by: https://github.com/anijain2305
2025-05-10 08:27:01 +00:00
Xilun Wu
bc4cf1c13a [BE] fix failing test_dp_state_dict_save_load on ROCm CI where world_size=7 (#153283)
**Summary**
I saw an unrelated CI failure `distributed/_composable/fsdp/test_fully_shard_state_dict.py::TestFullyShardStateDictMultiProcess::test_dp_state_dict_save_load` in one of my PR: https://hud.pytorch.org/pr/pytorch/pytorch/153225#41930032096

This is caused by triggering uneven sharding in FSDP2 at cbb03e6971/torch/distributed/fsdp/_fully_shard/_fsdp_param.py (L353-L361)

This didn't show up because the cuda CI has even number of GPUs (e.g. 2/4/8) but it's not true on ROCm CI. For the failing CI case, the device number is 7.

**Solution**
Skip the test if `self.world_size` can not divide `mlp_dim` (i.e. 16).

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153283
Approved by: https://github.com/fegin, https://github.com/weifengpy
2025-05-10 04:46:32 +00:00
Jake Stevens
b86d46ff21 [torch][ao] Properly strip tracking stats in _fold_conv_bn_qat for 1D (#152982)
Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this.

Test Plan:
Run N7113483 without this fix.

{F1977726982}

```
bento kernel build sensorml
```

Re-run with local version of kernel, containing this diff:

{F1977727151}

Notice that now, num_batches is removed.

Differential Revision: D74269649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982
Approved by: https://github.com/andrewor14, https://github.com/yushangdi
2025-05-10 01:20:18 +00:00
Natalia Gimelshein
9c99ea2991 error out on negative offs or on K=0 in group gemm (#153226)
Error out if K=0 in one of the grouped gemms to avoid hangs in #152668
Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already)

One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with
```
  File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm
    if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias):
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel
    offs is not None
  File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__
    raise TypeError("cannot determine truth value of Relational")
torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational
```
which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for.
Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226
Approved by: https://github.com/kwen2501
2025-05-10 01:13:18 +00:00
PyTorch MergeBot
e6dccb036e Revert "Fix fake tensor caching when output has unbacked (#153034)"
This reverts commit 4f425a0397.

Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see d07fbd41e3/1 ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2868100487))
2025-05-09 23:43:56 +00:00
Shangdi Yu
ee326137a9 [reland] Add graph module runtime asserts to AOTI (#153182)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

A reland of https://github.com/pytorch/pytorch/pull/152125.

added a try-except around the justknob internally. Also added more documentation

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already.

One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv.
See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor.

In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda
TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1  buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r  test_unbacked_equals_input_size
```

Differential Revision: D74361799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182
Approved by: https://github.com/henrylhtsang
2025-05-09 22:56:19 +00:00
Aaron Orenstein
4f425a0397 Fix fake tensor caching when output has unbacked (#153034)
We handle fake tensor caching in two ways:
1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
2. If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output.  In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:
1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034
Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan
2025-05-09 21:17:54 +00:00
Xilun Wu
cbb03e6971 [BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-05-09 20:40:54 +00:00
Ryan Guo
3976e52264 Fix torch.isin decomposition for scalar inputs (#153216)
This patch fixes a corner case of `torch.isin` decompisition when both
inputs are scalars. This pattern showed up from #141196.

Fixes #141196.

Error stack befor this patch:
```
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition
    res = opt_f()
          ^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base
    fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph(  # type: ignore[misc]

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner
    t = dispatch_trace(
        ^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace
    (self.create_arg(fn(*args)),),
                     ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
          ^^^^^^^^^^^
  File "<string>", line 1, in <lambda>
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper
    f_outs = fn(*f_args)
             ^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn
    outs = fn(*args)
           ^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call
    out = PropagateUnbackedSymInts(mod).run(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run
    self.env[node] = self.run_node(node)
                     ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node
    result = super().run_node(n)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__
    outs_unwrapped = func._op_dk(
                     ^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call
    r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp
    out = CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin
    return isin_default(elements, test_elements, invert=invert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default
    x = elements.view(*elements.shape, *((1,) * test_elements.ndim))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: view() received an invalid combination of arguments - got (), but expected one of:
 * (torch.dtype dtype)
 * (tuple of ints size)

While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {})
GraphModule: class GraphModule(torch.nn.Module):
    def forward(self):
         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0)
        x: "i64[][]" = torch.tensor(0)

         # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x)
        isin: "b8[][]" = torch.isin(x, x);  x = None
        return (isin,)

Original traceback:
  File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f
    return torch.isin(x, x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216
Approved by: https://github.com/williamwen42, https://github.com/peterbell10
2025-05-09 20:26:25 +00:00
Pian Pawakapan
d808a3e203 [dynamic shapes] guard_or_false for computeStorageNbytes (#150483)
removes fast path for computing storage, fixes some adjacent tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483
Approved by: https://github.com/laithsakka
2025-05-09 19:31:19 +00:00
Zhengxu Chen
fe11d300ac [nativert] Improve MPMCQueue tests. (#153154)
Summary:
- Use std::this_thread::yield and stop busy wating.
- Sort test file orders.

Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837
Test Plan: CI

Differential Revision: D74402536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-05-09 19:25:42 +00:00
Ti-Tai Wang
90fde0dc09 [ONNX] Support sym_float (#153200)
Fixes #153115

Note: torch.sym_int is not supported in this PR because it's not appeared in exported program, instead, it's `torch.ops.aten.sym_size.int()`.

```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[s35, s16]"):
             #
            sym_size_int_1: "Sym(s35)" = torch.ops.aten.sym_size.int(x, 0);  x = None
            return (sym_size_int_1,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153200
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-05-09 19:10:17 +00:00
Gabriel Ferns
da0b89bcbf Scheduler Flops refactor (#152708)
This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes:
1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`.
1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`.

Prelude to: https://github.com/pytorch/pytorch/pull/149697

Testing:
New unit tests cover functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152708
Approved by: https://github.com/xmfan, https://github.com/eellison
2025-05-09 19:01:43 +00:00
Boyuan Feng
073b0257ba [Graph Partition] Maintain relative order within partition during reordering (#153111)
PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes.

This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111
Approved by: https://github.com/eellison
2025-05-09 18:49:53 +00:00
henrylhtsang
595e21a9dd [cutlass-3] Add cutlass key for fbcode and OSS (#153081)
Differential Revision: [D74337959](https://our.internmc.facebook.com/intern/diff/D74337959/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153081
Approved by: https://github.com/drisspg
2025-05-09 17:38:31 +00:00
Boyuan Feng
ffda46e3be [Graph Partition] remove weak dep from partition_input_names (#152863)
Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names.

The following test failure is fixed by removing weak dependency from partition_input_names:
`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863
Approved by: https://github.com/eellison
2025-05-09 17:20:04 +00:00
Pian Pawakapan
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
Ryan Guo
18e13a67ce [dynamo] Harden torch function dispatchability check for attributes and methods access (#153082)
See more details in
https://github.com/pytorch/pytorch/issues/151771#issuecomment-2836372110.

Fixes #151771.

Differential Revision: [D74342291](https://our.internmc.facebook.com/intern/diff/D74342291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153082
Approved by: https://github.com/mlazos
2025-05-09 16:14:23 +00:00
Ankita George
916f6bafe7 Fix HF loading when there's no metadata file to work with fsspec (#152856)
Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys

Test Plan: unit test and e2e test reading from hf

Differential Revision: D74181513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856
Approved by: https://github.com/joecummings
2025-05-09 13:32:01 +00:00
Yu, Guangye
e06a08059a Add device guard for xpu conv on multi device (#153067)
# Motivation
fixes https://github.com/pytorch/pytorch/issues/153022
The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set.

# Additional Context
run the following script
```python
import torch
import torchvision.models as models

torch.manual_seed(0)

model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
model.eval()
data = torch.rand(1, 3, 224, 224)

device = torch.device('xpu:1')  # 'xpu:0'
model = model.to(device=device, dtype=torch.float16)
data = data.to(device, dtype=torch.float16)

with torch.no_grad():
    ret = model(data)
    print(ret)

print("Execution finished")
```
The output is
```bash
         -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01,  6.4551e-01,
         -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01,  3.2715e-02,
         -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01,
         -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01,  2.0312e+00]],
       device='xpu:1', dtype=torch.float16)
Execution finished

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067
Approved by: https://github.com/albanD, https://github.com/EikanWang
2025-05-09 09:41:51 +00:00
Dmitry Rogozhkin
aca2c99a65 xpu: get xpu arch flags at runtime in cpp_extensions (#152192)
This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list.

CC: @malfet, @jingxu10, @EikanWang, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD
2025-05-09 05:43:50 +00:00
Michael Lazos
c54aa0da01 [Cutlass] Fix tests (#153196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153196
Approved by: https://github.com/BoyuanFeng
2025-05-09 05:39:05 +00:00
eqy
b30d276abc [CUDA][cuBLASLt] Fix scale setting for allowFP16AccumulationCuBLAS true case (#153083)
Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py`
Should help resolve #151890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153083
Approved by: https://github.com/janeyx99
2025-05-09 02:27:17 +00:00
Ke Wen
4064062e18 [c10d] Test multiple CUDA Graph captures (#150040)
1. Do multiple captures
2. Perform multiple collectives in one capture
3. Multiple replays (existing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150040
Approved by: https://github.com/fduwjj
2025-05-08 22:14:03 +00:00
Will Feng
d9dc6b56ec Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112)
A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error):
```
Traceback (most recent call last):
  File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call
    CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call
    return fn(*proxy_args, **proxy_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm
    self = self.expand((dim1, dim2, dim3))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers
```
This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112
Approved by: https://github.com/jansel
2025-05-08 21:34:24 +00:00
Pian Pawakapan
4166373908 [dynamic shapes] guard_or_false for infer_size (#152146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146
Approved by: https://github.com/laithsakka
2025-05-08 21:27:22 +00:00
Boyuan Feng
590965f92f [Graph Partition][Flex Attention] analyze symints from subgraph inputs and outputs (#152878)
Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878
Approved by: https://github.com/drisspg
2025-05-08 20:25:35 +00:00
Aidyn-A
086e2c2399 [TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ (#152814)
The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on `sm_120+` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152814
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-05-08 19:34:20 +00:00
Mu-Chu Lee
b3524080dc [AOTInductor] Generate kernels separately for const graph and main graph (#153040)
Summary:
We should generate the kernel for const graph and main graph separately.
The reason is that when we run autotuning, we would create separate
kernel calls and we should make sure that main graph also contains the
runner.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040
Approved by: https://github.com/angelayi
2025-05-08 18:45:45 +00:00
Isuru Fernando
e5f869999c [inductor] Fix ModularIndexing assumptions (#152993)
Fixes https://github.com/pytorch/pytorch/issues/151198.

Since the result of ModularIndexing can be zero due to the modulo
operation, we should not make any assumption about ModularIndexing
being positive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152993
Approved by: https://github.com/yf225
2025-05-08 18:26:45 +00:00
Alvaro-Kothe
e86b6b2a19 Add tests to check pretty print when padding is a string in C++ API (#153126)
Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126
Approved by: https://github.com/Skylion007
2025-05-08 17:55:25 +00:00
PyTorch MergeBot
d36261d2e6 Revert "[dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)"
This reverts commit 0886d402f1.

Reverted https://github.com/pytorch/pytorch/pull/152740 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
PyTorch MergeBot
34d424d813 Revert "[dynamo] Support delattr on result of torch.compile(module) (#152741)"
This reverts commit 6c025b5a82.

Reverted https://github.com/pytorch/pytorch/pull/152741 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))
2025-05-08 17:31:21 +00:00
angelayi
3cd69350ed [export] Unflatten None (#153000)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153000
Approved by: https://github.com/pianpwk
2025-05-08 16:40:13 +00:00
PyTorch MergeBot
7b806a8cb1 Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)"
This reverts commit 9357635127.

Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))
2025-05-08 16:39:28 +00:00
Jithun Nair
fe8ebacee4 [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-08 16:12:16 +00:00
PyTorch MergeBot
05326b7e49 Revert "Add runtime asserts to AOTI (#152125)"
This reverts commit 834bc5e414.

Reverted https://github.com/pytorch/pytorch/pull/152125 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152125#issuecomment-2863554139))
2025-05-08 15:58:18 +00:00
Ke Wen
efa07df257 [c10d] Remove unordered PG destroy test (#153110)
torch.distributed does not support unordered ProcessGroup destroy. Removing the test.

Resolves #137507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-05-08 15:29:44 +00:00
Simon Fan
500cbeee4e [dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)
Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119
Approved by: https://github.com/jansel
ghstack dependencies: #151731, #151962
2025-05-08 15:12:16 +00:00
Simon Fan
6dea8ef555 [ca] hide unused scalar int sizes from dynamo (#151962)
together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962
Approved by: https://github.com/jansel
ghstack dependencies: #151731
2025-05-08 15:12:16 +00:00
Simon Fan
8f380b239f [ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)
This is the only way to support dynamic shapes on scalars right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731
Approved by: https://github.com/jansel
2025-05-08 15:12:08 +00:00
James Wu
4976b1a3a8 Keep raw cubin file around in case it gets deleted underneath us (#153064)
This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process.

We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton.

This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay.

Tests:
- Make test_static_cuda_launcher always delete the cubin path and reload it

Fixes #153030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064
Approved by: https://github.com/oulgen, https://github.com/jansel
2025-05-08 14:29:19 +00:00
rzou
2926dd4d8e Stop proxy-ing autograd.Function.ctx into the graph (#152621)
The reason why we did this before is because that's how our older
autograd.Function x Dynamo interaction work, but we've since adopted
newer designs that don't actually need the autograd.Function.ctx proxied
into the graph.

We still need a fx.Proxy for the autograd.Function.ctx object, so
whenever we do I create one via discard_graph_changes.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621
Approved by: https://github.com/oulgen
2025-05-08 13:32:54 +00:00
ignasa007
22c31046d4 Fixed rerr computation in lobpcg (#152789)
Fixes #101075

This PR fixes an issue with the computation of residuals in the LOBPCG algorithm.

**Bug**: [Line 788](8f54e56e62/torch/_lobpcg.py (L788)) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](8f54e56e62/torch/_lobpcg.py (L776)), but it uses the raw eigenvalue-estimates instead of their absolute values.

**Consequence**: This made the algorithm's success sensitive to initialization of eigenvectors.

**Tests**:
- I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds).
- I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG).
- I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully.

Let me know if further test cases or benchmarks are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789
Approved by: https://github.com/pearu, https://github.com/lezcano
2025-05-08 12:22:31 +00:00
Animesh Jain
34d4363e6d [dynamo] Fix super and classmethod binding of cls object (#153105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153105
Approved by: https://github.com/jansel
ghstack dependencies: #152883
2025-05-08 12:07:08 +00:00
karthickai
9357635127 [inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353)
Fixes #151930

This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages.

The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg.

In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging.

Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py).
- Verified both successful and failing assertion cases include the operator name.
- Verified that generated Triton code contains the op name inside the asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-05-08 08:28:05 +00:00
zeshengzong
3c87529d23 Make device check error message more descriptive (#150750)
Fixes #122757

## Test Result

```python
import torch

model_output = torch.randn(10, 5).cuda()
labels = torch.randint(0, 5, (10,)).cuda()
weights = torch.randn(5)

loss_fn = torch.nn.CrossEntropyLoss(weight=weights)
loss = loss_fn(input=model_output, target=labels)
print(loss)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../loss2.py", line 17, in <module>
    loss = loss_fn(input=model_output, target=labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward
    return F.cross_entropy(
           ^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy
    return torch._C._nn.cross_entropy_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750
Approved by: https://github.com/malfet
2025-05-08 06:19:44 +00:00
rzou
94ca3a4666 Add torch._C.Tag.needs_contiguous_strides (#152859)
this forces inductor to force the inputs to be contiguous.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152859
Approved by: https://github.com/eellison
2025-05-08 04:49:59 +00:00
Menglu Yu
2d25e4d478 [1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380)
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB  Down: 42MiB  (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining     0/4                                                                                 6.7s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },
```

Differential Revision: D70522237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380
Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803
2025-05-08 04:44:15 +00:00
Animesh Jain
6f6fac6a41 [dynamo] Fix bug in hasattr(tensor, "size") (#152883)
Fixes https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152883
Approved by: https://github.com/StrongerXi
2025-05-08 01:16:01 +00:00
Shangdi Yu
834bc5e414 Add runtime asserts to AOTI (#152125)
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925

Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.

Also factored out the run time assertion logic to a separate function.

        We need to generate runtime asserts directly in Inductor instead
        of just re-using the asserts from input graphs becase we reuse the
        same ShapeEnv as before. In particular, on subsequent graph passes,
        we would immediately turn all of these assertions into noops,
        because when we evaluated their expressions, we would see that
        because we had a deferred runtime assert in the ShapeEnv, we
        know "oh, of course this expression is True" already.
        One example is below:
```
        class Model(torch.nn.Module):
            def forward(self, a, b, c):
                nz = torch.nonzero(a)
                ones = a.new_ones([nz.size(0), b.size(0)])
                torch._check(ones.size(0) >= 1)
                equals = torch.add(ones, c)
                return equals
        torch._dynamo.mark_dynamic(c, 0)
```
        When we re-use the ShapeEnv in Inductor lowering, the check that checks
        a and nonzero have the same shape would be evaluted to True after we resolve
        unbacked bindings using the ShapeEnv.
        See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor.

        In addition to the Inductor generated runtime asserts, we also
        need the runtime asserts from the input graph, because some derived
        runtime asserts are not generated in Inductor. One example is
        below:
```
        class Model(torch.nn.Module):
            def forward(self, x):
                y = x.reshape(100, -1).clone()
                y = y + 1
                return y

        dynamic_shapes = {
            "x": {0: torch.export.Dim.DYNAMIC},
        }
        x.shape[0] needs to be a multiple of 100.
```
        See test_aoti_runtime_asserts_backed_symint in test_aot_inductor.

Example:

```
    def forward(self):
        arg0_1: "f32[s35]";

        arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)

         #
        mod: "Sym(Mod(s35, 100))" = sym_size_int % 100;  sym_size_int = None
        eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0;  mod = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'");  eq_2 = _assert_scalar = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
        view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]);  arg0_1 = None
        clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view);  view = None

         # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
        add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1);  clone = None
        return (add_6,)
```

Generated cpp code:

```
    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
    auto arg0_1 = std::move(inputs[0]);
    auto arg0_1_size = arg0_1.sizes();
    int64_t s35 = arg0_1_size[0];
    inputs.clear();
    auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
    if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
```

Differential Revision: D73596786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
2025-05-08 00:27:24 +00:00
Michael Lazos
20e2ca3e29 [Dynamo] Allow inlining into AO quantization modules (#152934)
This adds dynamo inlining into `torch.ao.quantization.fake_quantize`.

This is needed for QAT compatbility w/ an RL training model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152934
Approved by: https://github.com/williamwen42
2025-05-07 23:58:11 +00:00
Michael Lazos
6b9d741e1c [Cutlass] Handle broadcasting in EVT python codegen (#152733)
Previously merged:
* #151713
* #151405
* #150905
* #152306
* #152305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152733
Approved by: https://github.com/eellison
2025-05-07 23:09:02 +00:00
Meet Patel
4270517cbf Fix test/test_optim.py error message. (#153076)
Fixes an error message in test/test_optim.py

Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable".

Fix: The error message now says correctly: "Adagrad does not currently support capturable".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076
Approved by: https://github.com/janeyx99
2025-05-07 22:46:05 +00:00
Eddie Yan
cecfc7dc53 [CUDA][cuDNN] Fix handling of CPU side input and target length tensors in CTCLoss (#152745)
https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU`

see also #152421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745
Approved by: https://github.com/Skylion007
2025-05-07 22:01:18 +00:00
Ti-Tai Wang
773a91c775 [ONNX] dynamic_shapes uses DYNAMIC (#153065)
Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065
Approved by: https://github.com/justinchuby
2025-05-07 21:48:41 +00:00
Zhengxu Chen
5bb154e6fd [nativert] Move MPMCQueue to torch/nativert. (#152837)
Summary:
Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md

To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.

This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime.

Differential Revision: D74184245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837
Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok
2025-05-07 21:17:42 +00:00
Paul Zhang
d2ee606e9b [Inductor] Set correct baseline for decomposek test (#152897)
Differential Revision: D74218923

Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897
Approved by: https://github.com/eellison
2025-05-07 21:02:47 +00:00
henrylhtsang
dd7d231ed3 [cutlass backend][test] re-enable test_cuda_compile_command for fbcode (#153001)
Differential Revision: [D74284047](https://our.internmc.facebook.com/intern/diff/D74284047/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153001
Approved by: https://github.com/ColinPeppler
2025-05-07 19:06:24 +00:00
Joel Schlosser
d9b8473b59 [Dynamo] Guard serialization for RANGE_ITERATOR_MATCH (#152872)
Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes.

This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730, #152865
2025-05-07 18:58:18 +00:00
Joel Schlosser
52f7106c00 [Dynamo] Guard serialization for TUPLE_ITERATOR_LEN (#152865)
Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes.

Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728, #152730
2025-05-07 18:58:18 +00:00
Joel Schlosser
fb500d0b1c [Dynamo] Guard serialization for SEQUENCE_LENGTH (#152730)
Tests only; no other changes needed. Test logic uses a tuple function input to trigger installation of a SEQUENCE_LENGTH guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152730
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727, #152728
2025-05-07 18:58:18 +00:00
Joel Schlosser
42954ab28e [Dynamo] Guard serialization for CLOSURE_MATCH (#152728)
Unsupported because it uses unsupported FUNCTION_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152728
Approved by: https://github.com/jansel
ghstack dependencies: #152725, #152727
2025-05-07 18:58:18 +00:00
Joel Schlosser
a9186ec723 [Dynamo] Guard serialization for FUNCTION_MATCH (#152727)
Unsupported because it uses unsupported ID_MATCH.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152727
Approved by: https://github.com/jansel
ghstack dependencies: #152725
2025-05-07 18:58:18 +00:00
Joel Schlosser
a6f51be2fd [Dynamo] Guard serialization for NN_MODULE (#152725)
Throws an error when attempting to serialize an NN_MODULE guard. It is not supported because it uses the unsupported ID_MATCH guard (#152330):

a6dd1c2208/torch/_dynamo/guards.py (L1738-L1739)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152725
Approved by: https://github.com/jansel
2025-05-07 18:58:17 +00:00
eqy
172e641529 [CUDA] Rest peak memory stats before running test_set_per_process_memory_fraction (#152540)
Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative

Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540
Approved by: https://github.com/Skylion007
2025-05-07 17:02:39 +00:00
henrylhtsang
8b9c9a327f [cutlass backend] cache filtered ops based on layouts (#152580)
Differential Revision: [D73972687](https://our.internmc.facebook.com/intern/diff/D73972687/)

Add cache to store the list of filtered ops for a specific shape + layout + dtype (aka hash on input_nodes).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152580
Approved by: https://github.com/eellison
2025-05-07 16:38:22 +00:00
xinan.lin
56879f64a8 [Break XPU] Fix XPU UT failures introduced by community. (#152945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152945
Approved by: https://github.com/Skylion007, https://github.com/EikanWang
2025-05-07 08:01:31 +00:00
Guilherme Leobas
ae1e51b6ad Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-07 04:03:14 +00:00
Animesh Jain
ecd74c953f [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-07 02:36:44 +00:00
Colin Peppler
81b6920c68 [aoti] skip input symbol codegen for sympy expr w/ many symbols (#152579)
Issue was that
- symbol-ids appeared out-of-order w.r.t to the order of the forward inputs
```
def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..)
```
- this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already.
- well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs.

```
# for example
s3 = arg1.size(0) + 1
s4 = argN.size(0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-05-07 01:18:09 +00:00
PyTorch MergeBot
a28dcdba2c Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860)"
This reverts commit 613bd46272.

Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
PyTorch MergeBot
f6db749e60 Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731)"
This reverts commit 18229a5300.

Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:54 +00:00
PyTorch MergeBot
8f208dc75a Revert "[ca] hide unused scalar int sizes from dynamo (#151962)"
This reverts commit 4555ed8c83.

Reverted https://github.com/pytorch/pytorch/pull/151962 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
PyTorch MergeBot
64bbf58fb4 Revert "[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119)"
This reverts commit 7aebb127bf.

Reverted https://github.com/pytorch/pytorch/pull/152119 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))
2025-05-07 00:56:53 +00:00
Isalia20
56492bfcb9 [MPS] SDPA specialized kernels (#152781)
Paritally fixes #139668 and #152550

Still work in progress. Following needs to be addressed:
- [x] Some tests are failing and need to check why and bugfix
- [x] Benchmark the new kernels and  add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels)
- [x] Add tests to cover the specialized paths(if applicable)
- [x] Code cleanup

**Tested on Macbook M1 Pro**
### Vector Fast Path (q_len=1, k_len=256)
- Old: 0.378 ms
- New: 0.260 ms
- **31.2% speed improvement**

### Vector 2-pass (q_len=1, k_len=4096)
- Old: 0.627 ms
- New: 0.370 ms
- **41.0% speed improvement**

### Vector Fast Path (q_len=8, k_len=256)
- Old: 0.545 ms
- New: 0.322 ms
- **40.9% speed improvement**

### Vector 2-pass (q_len=8, k_len=4096)
- Old: 1.318 ms
- New: 1.057 ms
- **19.8% speed improvement**

Script to get perf:
```
import torch
import time

def benchmark_sdpa(config, iterations=100):
    device = config.get("device", "cpu")
    batch = config["batch"]
    heads = config["heads"]
    q_len = config["q_len"]
    k_len = config["k_len"]
    head_dim = config["head_dim"]

    q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32)
    k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)
    v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)

    for _ in range(5):
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()

    total_time = 0.0
    for i in range(iterations):
        start = time.perf_counter()
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()
        end = time.perf_counter()
        total_time += end - start

    avg_time = total_time / iterations
    print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations")
    return avg_time

def main():
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Running benchmarks on device: {device}")

    benchmarks = [
        {
            "name": "Vector Fast - Small q_len & moderate k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length triggers vector fast path
            "k_len": 256,    # moderate key length
            "head_dim": 64,
            "device": device,
        },
        {
            "name": "Vector 2-pass - Small q_len & long k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length
            "k_len": 4096,   # long key length triggers the 2-pass variant
            "head_dim": 64,
            "device": device,
        },
        # {
        #     "name": "Full Attention - Moderate q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # longer query sequence length
        #     "k_len": 8192,    # matching key length for full attention paths
        #     "head_dim": 64,
        #     "device": device,
        # },
        # {
        #     "name": "Full Attention - Longer q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # very long sequence length
        #     "k_len": 8192,
        #     "head_dim": 64,
        #     "device": device,
        # },
    ]

    iterations = 100
    for config in benchmarks:
        benchmark_sdpa(config, iterations=iterations)

if __name__ == "__main__":
    main()

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781
Approved by: https://github.com/malfet
2025-05-07 00:40:11 +00:00
Joel Schlosser
2b2b790908 [Dynamo] Guard serialization for CONSTANT_MATCH (#152724)
This PR adds testing only; no non-test changes were needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152724
Approved by: https://github.com/jansel
ghstack dependencies: #152704
2025-05-07 00:36:39 +00:00
Joel Schlosser
6d1e8994d3 [Dynamo] Guard serialization for EQUALS_MATCH (#152704)
This PR:
* Makes no changes to non-test code to support serialization for EQUALS_MATCH
* Adds test logic involving a custom-defined constant type to trigger the guard installation here:

72337bdcf2/torch/_dynamo/variables/user_defined.py (L792)

Q: Is there a better way to trigger installation of this guard or is this sufficient?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152704
Approved by: https://github.com/jansel
2025-05-07 00:28:31 +00:00
henrylhtsang
61aa77e216 [cutlass backend][BE][clean-up] refactor to remove use of autotune_fallback_to_aten=True in cutlass backend tests (#152850)
Differential Revision: [D74192001](https://our.internmc.facebook.com/intern/diff/D74192001/)

Motivation: clean up post https://github.com/pytorch/pytorch/issues/147479. I plan to leave the rest of the clean-up as an first time issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152850
Approved by: https://github.com/chenyang78
2025-05-06 23:48:57 +00:00
Ti-Tai Wang
5fa5017479 [ONNX] Suggest users setting dynamo=True when exporting (#152478)
Fixes #152025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152478
Approved by: https://github.com/justinchuby
2025-05-06 23:18:11 +00:00
Ryan Guo
6c025b5a82 [dynamo] Support delattr on result of torch.compile(module) (#152741)
This is essentially a follow-up on #122098, where we added support of
`getattr` and `setattr` on result of `torch.compile(module)`, but didn't
add support for `delattr`.

Fixes #150711.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741
Approved by: https://github.com/anijain2305
ghstack dependencies: #152740
2025-05-06 22:30:37 +00:00
Ryan Guo
0886d402f1 [dynamo] Avoid running torch.nn.Module.__call__ twice under torch.compile(mod) (#152740)
When we do `torch.compile(mod)`, we eventually end up returning a new
module instance, whose `forward` method is the result of
`torch.compile(mod.__call__)`, meaning it already captures all the extra
logic (e.g., hook firing) from the default `torch.nn.Module.__call__`.
As a result we can't reuse the inherited default `__call__` as is,
because we'd end up running the logic twice.

This patch makes the returned `OptimizedModule` override the default
`__call__`, and directly calls into its compiled `forward` method.

Fixes #149502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740
Approved by: https://github.com/anijain2305
2025-05-06 22:30:37 +00:00
Krishna Bindumadhavan
08f5371571 [float16]: Fix the accumulation type for dot and gemv (#152676)
Fixes #147860

Also, partially address: https://github.com/pytorch/pytorch/issues/125438

Use float32 for accumulation with float16 and and bfloat16 types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676
Approved by: https://github.com/malfet
2025-05-06 18:10:08 +00:00
Boyuan Feng
7dd9d514d2 [Graph Partition] remove PRECOMPUTED_SIZE from partition symbol inputs (#152864)
PRECOMPUTED_SIZE is computed during runtime and should not be included in graph_partition_inputs. See the following example for a PRECOMPUTED_SIZE `ps0`.

![image](https://github.com/user-attachments/assets/5aa949a9-b8e0-4b77-8702-95b96b58694e)

full output code: [P1803820480](https://www.internalfb.com/phabricator/paste/view/P1803820480)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152864
Approved by: https://github.com/eellison
2025-05-06 17:35:29 +00:00
Ke Wen
a8f727c439 [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-06 15:27:30 +00:00
PyTorch MergeBot
fcd5e49138 Revert "[dynamo] Recursively realize the stack_values (#152853)"
This reverts commit 460888f908.

Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))
2025-05-06 15:02:57 +00:00
Joel Schlosser
b06cbd49f1 [Dynamo] Guard serialization for TENSOR_SUBCLASS_METADATA_MATCH (#152626)
This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.)

This PR tests the guard on several traceable wrapper tensor subclasses:
* `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally
* `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata
* `SubclassWithMeta`: stores non-trivial extra metadata
* `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod
* `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626
Approved by: https://github.com/jansel
2025-05-06 14:06:36 +00:00
Blaine Burton Rister
bc11afd41f [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-06 10:06:39 +00:00
Yu, Guangye
e32a16a9da Correct torch.xpu.is_bf16_supported return False if no XPU detected (#152317)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/152301
When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False).

# Solution
Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and,
- return `False` if XPU is not available
- return `True` if `including_emulation` is True
- return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317
Approved by: https://github.com/EikanWang
2025-05-06 10:03:17 +00:00
Animesh Jain
460888f908 [dynamo] Recursively realize the stack_values (#152853)
Might also fix - https://github.com/pytorch/pytorch/issues/135696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel
2025-05-06 06:30:31 +00:00
Animesh Jain
97dfd8dd53 [invoke_subgraph] Run missing graph passes recursively (#152675)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152675
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #152772, #152770
2025-05-06 02:55:34 +00:00
Animesh Jain
cc254eaa7c [inductor][refactor] Refactor the fetching of subgraph names (#152770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152770
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #152772
2025-05-06 02:55:34 +00:00
Animesh Jain
b1d34acac5 [fx] Recursive DCE on subgraphs (#152772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152772
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2025-05-06 02:55:34 +00:00
Yu, Guangye
35c727e7ff Fix typo on test_multi_device_context_manager for XPU (#152812)
# Motivation
Align https://github.com/pytorch/pytorch/pull/152474, fix the typo on UT for XPU introduced by https://github.com/pytorch/pytorch/issues/148864
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152812
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
2025-05-06 02:51:19 +00:00
Yiming Zhou
1d7728056b [nativert] Move TensorMeta to pytorch core (#152475)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`

Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.

Test Plan:

Added test under `test/cpp/nativert/test_tensor_meta.cpp`

Differential Revision: D73820548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
2025-05-06 01:50:46 +00:00
David Berard
f097e83369 [inductor][retry] Realize bucketize/searchsorted output (#152858)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

**Retry**
Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858
Approved by: https://github.com/aakhundov
2025-05-06 01:32:26 +00:00
drisspg
14f8066910 Ensure mxfp8 scaled_mm works w/ max-autotune (#152744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152744
Approved by: https://github.com/Skylion007
2025-05-06 01:16:57 +00:00
Jane Xu
4979ca5ffa Synchronize in foreach tests after profiling (#152857)
After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857
Approved by: https://github.com/davidberard98
2025-05-06 00:56:48 +00:00
Pian Pawakapan
13dcf80a53 [dynamic shapes] use try-catch instead of guard_or_true for reshape_view_helper (#152638)
Test Plan: test_export

Differential Revision: D74033649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152638
Approved by: https://github.com/laithsakka
2025-05-06 00:54:24 +00:00
PyTorch MergeBot
103fe856e1 Revert "Add infra to run CPython tests under Dynamo (#150787)"
This reverts commit 7c96dd8f0c.

Reverted https://github.com/pytorch/pytorch/pull/150787 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a failed test is showing up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150787#issuecomment-2852818113))
2025-05-06 00:20:02 +00:00
PyTorch MergeBot
cc954848d4 Revert "[c10d] Fix extra CUDA context created by barrier (#149144)"
This reverts commit 457fa820ad.

Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))
2025-05-05 22:56:50 +00:00
Felix Su
2ce6d169fc [IR] Input Adapter refactor prototype (#152459) (#152575)
Summary:

1. Adding `input` field to `_adapt_flat_args` function
2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error)
3. Pass `args` as input to `_adapt_flat_args`

These changes are made to update the InputAdapter

see more context in D73811508

Test Plan: see D73811508

Differential Revision: D73945419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575
Approved by: https://github.com/angelayi
2025-05-05 22:51:58 +00:00
albanD
22d1359bc6 Move warning from item to specific number conversions (#152709)
Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-05-05 20:46:05 +00:00
PyTorch MergeBot
99dac7005f Revert "[Inductor] FX backend via Wrapper IR (#146942)"
This reverts commit a7691140a0.

Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see a7691140a0/1 ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))
2025-05-05 20:01:29 +00:00
Blaine Burton Rister
a7691140a0 [Inductor] FX backend via Wrapper IR (#146942)
# Sub-PRs

These PRs contain refactors from the main one. They should be reviewed and merged first.

- https://github.com/pytorch/pytorch/pull/150458
- https://github.com/pytorch/pytorch/pull/152391
- https://github.com/pytorch/pytorch/pull/152587

# Feature

The goals of this PR are twofold.

## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen.

In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components.

This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR.

## Goal 2: Convert Wrapper IR into FX IR.

One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc.

It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes.

The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source.

Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa.

# Current status

Things that seem to work:

- Converted a lot of the most common Python codegen lines to Wrapper IR lines.
     - Handled the following cases, in addition to what was already in the Memory Planning IR:
         - Comments
         - Triton kernels
         - Extern/fallback kernels
         - Freeing tensors (`del buf0`)
         - MultiOutput
         - Graph outputs
         - ReinterpretView / StorageBox, for both call args and outputs.
     - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code.
- Prototype FX converter which can handle some of the most common use cases.
   - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565).
   - Calling wrapped Triton kernels.
   - Calling extern kernels and certain types of fallback kernels.
       - Support both `extern_kernels.*` and `aten.*`.
       - Support multi-output kernels like `torch.topk`.
   - Graphs with multiple inputs/outputs.
   - Training i.e. calling `Tensor.backward()` in a compiled function.
   - Graph breaks (training).
- Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen.

Things that don't work:
- Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections.
         - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX.
         - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR.
         - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this.

# Out-of-tree compilers

With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below.

```
from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen

class MyCustomBackend(WrapperFxCodegen):
     def compile_graph(self, gm):
         # Add 1 to the graph's outputs
         def compiled_fn(*args):
             return [x + 1 for x in gm.graph.forward(*args)]
         return compiled_fn
```

# Example FX graphs

This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`.

Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table.
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    return (buf0,)
```

Here's a more complicated graph that calls a `torch.addmm` extern kernel.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}})
    %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    return (buf2,)
```

Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {})
    %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {})
    %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {})
    %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}})
    %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}})
    return (buf1, buf2)
```

Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op.

```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}})
    %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {})
    return (buf0_view_buf0_0,)
```

Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`.
```
graph():
    %s6 : [num_users=0] = placeholder[target=s6]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}})
    return buf0
```

Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions.
```
graph():
    %s10 : [num_users=0] = placeholder[target=s10]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %arg2_1 : [num_users=1] = placeholder[target=arg2_1]
    %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0})
    %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s10**2)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s10**2, XBLOCK: 64}})
    return buf0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942
Approved by: https://github.com/jansel
2025-05-05 19:34:49 +00:00
Ryan Guo
51e77f3b30 [dynamo] replace unimplemented with unimplemented_v2 in variables/torch_functions.py (#151278)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151278
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
ghstack dependencies: #151277
2025-05-05 18:45:40 +00:00
Ryan Guo
9e24f9b523 [dynamo] replace unimplemented with unimplemented_v2 in variables/functions.py (#151277)
This addresses part of #147913.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151277
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-05-05 18:45:40 +00:00
zhxchen17
fd6d4a6a24 [dynamo] Guard serialization for DICT_KEYS_MATCH (#152723)
DICT_KEYS_MATCH

Differential Revision: [D74091886](https://our.internmc.facebook.com/intern/diff/D74091886/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152723
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716, #152721
2025-05-05 18:05:56 +00:00
zhxchen17
2da9ab4b1c [dynamo] Guard serialization for MAPPING_KEYS_CHECK (#152721)
MappingProxyType

Differential Revision: [D74091363](https://our.internmc.facebook.com/intern/diff/D74091363/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152721
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687, #152716
2025-05-05 18:05:56 +00:00
zhxchen17
24e1666b3a [dynamo] Guard serialization for WEAKREF_ALIVE (#152716)
Punt on WEAREF_ALIVE as weakref won't live across the process and users might need to drop them upfront.

Differential Revision: [D74088735](https://our.internmc.facebook.com/intern/diff/D74088735/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152716
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616, #152687
2025-05-05 18:05:56 +00:00
zhxchen17
2cb16df6e2 [dynamo] Guard serialization for DUPLICATE_INPUT. (#152687)
Seems this guard is not very active. Adding a test to detect error handling at least.

Differential Revision: [D74074837](https://our.internmc.facebook.com/intern/diff/D74074837/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152687
Approved by: https://github.com/jansel
ghstack dependencies: #152615, #152616
2025-05-05 18:05:56 +00:00
zhxchen17
ffd58293f7 [dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616)
Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states.

## Test Cases:

0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed.
1. torch.compile() nested in vmap.
2. torch.compile() nested in grad.
3. torch.compile() nested in jvp + vmap
4. torch.compile() nested functionalize
5. torch.compile() nested in vmap + grad

Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616
Approved by: https://github.com/zou3519
ghstack dependencies: #152615
2025-05-05 18:05:56 +00:00
zhxchen17
1d1cbcd8a3 [dynamo] Guard serialization for DUAL LEVEL. (#152615)
Seem dual level counter should be stored in OutputGraph so that the value can be preserved through roundtripping.

Differential Revision: [D74008786](https://our.internmc.facebook.com/intern/diff/D74008786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152615
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-05-05 18:05:56 +00:00
Alexander Grund
99287b170b Generate test reports for pytest when option is given (#152170)
The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled.

See also https://github.com/pytorch/pytorch/issues/126523
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152170
Approved by: https://github.com/Skylion007
2025-05-05 17:46:40 +00:00
Brian Hirsh
131da0a982 Add a test for AsyncCollectiveTensor handling for maybe-view ops (#152688)
We never added a proper test for the fix from https://github.com/pytorch/pytorch/pull/134661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152688
Approved by: https://github.com/kwen2501
ghstack dependencies: #152195
2025-05-05 17:21:00 +00:00
Brian Hirsh
5abe74857a SAC: fix recompute tag propagation for ops with list[tensor] inputs (#152195)
There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing.

This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running.

There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195
Approved by: https://github.com/soulitzer
2025-05-05 17:21:00 +00:00
Guilherme Leobas
7c96dd8f0c Add infra to run CPython tests under Dynamo (#150787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787
Approved by: https://github.com/zou3519
2025-05-05 17:20:14 +00:00
Ke Wen
7a2df6a00b [PGNCCL] Add FP8 support (#152706)
NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24.

NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op.

Test plan:
- test_allreduce_float8
- test_reduce_scatter_float8

Resolves #148344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706
Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever
2025-05-05 16:02:27 +00:00
Nikita Shulga
fe36d7dc44 [MPSInductor] Fix truncdiv implementation (#152788)
For integral dtypes it should be just an alias for division

Fixes `GPUTests.test_div7_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152788
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #152663, #152515, #152737, #152743, #152758
2025-05-05 13:31:51 +00:00
Isuru Fernando
0a470dc7c1 [inductor] fix lowering for cummin, cummax for one element tensors (#151931)
Fixes https://github.com/pytorch/pytorch/issues/151738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151931
Approved by: https://github.com/eellison
2025-05-05 13:05:59 +00:00
Nikita Shulga
0ffd31dc8a [MPS] Migrate div roudning modes (#152758)
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24.,  3.]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
2025-05-05 03:02:29 +00:00
James Wu
93d8f6ee32 [reland] Detailed triton kernel logging (#152694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152694
Approved by: https://github.com/Skylion007
2025-05-05 02:46:57 +00:00
Dharak Kharod
a78eec88b8 Implement util function compute_global_tensor_shape for 1D device mesh (#152751)
### Summary

Recreating #151990 to mitigate easyCLA failure

compute_global_tensor_shape util function takes in local tensor shape, device mesh
and placements. We all gather the shapes from the shards and according to the placement
type we construct the global shape.

Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared

### Test

`pytest test/distributed/tensor/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751
Approved by: https://github.com/XilunWu
2025-05-05 02:44:31 +00:00
Aaron Gokaslan
49b9efdf1f [BE]: Cleanup traceutils with fmtlib (#152265)
Simplify code and make it faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-05-04 22:27:19 +00:00
Julius Herb
8f54e56e62 Add optional device index to AOTIModelPackageLoader (#152093)
This is my suggestion for resolving #152087

This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093
Approved by: https://github.com/desertfire
2025-05-04 11:40:12 +00:00
FFFrog
fd8fd01d25 [OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914)
As the title stated.

**Changes**:
- Add get_rng_state & set_rng_state support for OpenReg
- Add _lazy_init support for OpenReg
- Remove redundant code for cuda/Module.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914
Approved by: https://github.com/albanD
2025-05-04 09:42:08 +00:00
PyTorch MergeBot
8faa225695 Revert "[inductor] Realize bucketize/searchsorted output (#152644)"
This reverts commit 9ae4906b21.

Reverted https://github.com/pytorch/pytorch/pull/152644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152644#issuecomment-2848743442))
2025-05-03 18:16:39 +00:00
Joe Wang
6ae690f8f0 add support for 0 size shardedTensor and recalculate metadata from all_gather (#152583)
Summary:
change set
1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here
2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information

Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786.

Differential Revision: D73903933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583
Approved by: https://github.com/q10, https://github.com/fduwjj
2025-05-03 17:26:29 +00:00
rzou
762844355e Make DispatchKeySet serializable; add __eq__ (#152732)
These seem like reasonable things to add. Also fixes a bug in vLLM for
me.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732
Approved by: https://github.com/bdhirsh
2025-05-03 14:40:06 +00:00
PyTorch MergeBot
cc28b43950 Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)"
This reverts commit 844842dfbf.

Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))
2025-05-03 08:31:31 +00:00
Ke Wen
457fa820ad [c10d] Fix extra CUDA context created by barrier (#149144)
Fixes #149119.

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses `device_id` given by user
when calling `init_process_group`.

This PR also uses `torch._C._get_accelerator()` to determine the device
type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever
2025-05-03 03:13:34 +00:00
PaulZhang12
84aa0985fb [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-03 02:23:54 +00:00
Anatoly Myachev
5e9682719f [Inductor UT] Generalize device-bias code in test_flex_attention.py (#151937)
@EikanWang @etaf @guangyey please take a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937
Approved by: https://github.com/drisspg
2025-05-03 01:12:49 +00:00
rzou
3d777bae10 Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #148104
2025-05-03 00:02:24 +00:00
rzou
2b37a726e0 Refactor layout constraint selection logic (#148104)
This PR:

- cleans up some existing comments that don't make sense anymore
- hooks up the "custom_op_default_layout_constraint" back (that seems to
have broken)
- cleans up the "lazy registration path" which seems to never get hit
anymore
- adds dislike_padding to nodes that require exact strides

Test Plan:
- tests + CI

disable padding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-05-03 00:02:24 +00:00
iupaikov-amd
730a077d48 [ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339)
Unskipping the test, should work fine now.

Related PR: https://github.com/pytorch/pytorch/pull/144572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339
Approved by: https://github.com/jeffdaily
2025-05-02 22:02:30 +00:00
Thomas Bohnstingl
ea12a38668 [associative_scan] Refactoring of input checking and dynamo invocation (#148657)
This PR is the counterpart of https://github.com/pytorch/pytorch/pull/142125 for the associative_scan operation. The way the input checks are performed and the combine_fn is not invoked in the frontend to check the output trees, but rather dynamo is used for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148657
Approved by: https://github.com/ydwu4
2025-05-02 21:39:28 +00:00
NikhilAPatel
8afe40bc5e [Inductor] Fix kernel argument ordering when using dynamic shapes with workspace (#152660)
Summary:
This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted.

When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: **size hints → workspace tensor → grid args**. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: **workspace tensor → size hints → grid args**, which is incorrect.

This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout.

Test Plan:
contbuild and OSS CI

Reviewers: paulzhan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg
2025-05-02 21:32:07 +00:00
David Berard
9ae4906b21 [inductor] Realize bucketize/searchsorted output (#152644)
**Context**:
bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro:

https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554

shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\*\*15 binary searches, the fused version does 2\*\*15 * 384 - one for each of the broadcasted outputs.

**Solution**:
Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on.

After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled).

Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644
Approved by: https://github.com/benjaminglass1, https://github.com/eellison
2025-05-02 20:31:17 +00:00
Alexander Grund
74b496e54c Cleanup DeviceInterface in triton test (#152409)
- Remove inherited functions
- Return valid device_count (1 device: idx=0)
- Remove unused function `triton_supported`

Followup to #144399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152409
Approved by: https://github.com/jansel
2025-05-02 20:25:32 +00:00
Eddie Yan
ec68d082a1 [CUDA][TF32] Account for TF32 in test_conv2d_same_padding (#152618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618
Approved by: https://github.com/msaroufim, https://github.com/Skylion007
2025-05-02 20:19:00 +00:00
Catherine Lee
39c0b01970 [ez] Disable failing test in periodic no gpu no avx (#152698)
Failing on periodic after it was added in #152542
Ex
inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh [GH job link](https://github.com/pytorch/pytorch/actions/runs/14775755628/job/41485185829) [HUD commit link](6f6acb4128)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152698
Approved by: https://github.com/huydhn, https://github.com/hl475
2025-05-02 20:02:48 +00:00
Akash Verma
5d860c1e54 [ROCm][CI] Enabled fp8 distributed tests in test_micro_pipeline_tp.py for MI300 (#151977)
This PR enabled fp8 distributed tests on MI300.
For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977
Approved by: https://github.com/jeffdaily
2025-05-02 19:22:18 +00:00
Eddie Yan
0488883d6e [cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076)
turns out the backward is not supported yet, whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076
Approved by: https://github.com/drisspg
2025-05-02 18:43:33 +00:00
Prachi Gupta
1ea2731e26 [ROCm] Add support for SymmetricMemory (#150580)
This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream.

**NOTE:**
ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580
Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu

Co-authored-by: Xiaodong Wang <xdwang@fb.com>
2025-05-02 18:35:14 +00:00
Laith Sakka
376529c78b consolidate guard_or_x and definitely_x (#152463)
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
2025-05-02 18:08:11 +00:00
Jithun Nair
844842dfbf [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-02 17:21:18 +00:00
Laith Sakka
f65fb0a23d Make PGO code state not sensitive to file path by hashing file content when the file is available. (#152628)
In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts.
but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number).

after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided.

Sometimes we do not have access to the source code, (file does not exists)
This seems to happen mostly when we re-trace a compiled function but generally it can happen .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628
Approved by: https://github.com/oulgen
2025-05-02 17:11:21 +00:00
Laith Sakka
38a9a8b7f7 Fix: Consider input defined unbacked during inductor codegen for runtime asserts (#152231)
So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now,
deferred runtime assertions that uses those  is never generated.

This PR changes that, such that in the forward graph we consider those and generate the corresponding
runtime assertions of them. We still ignore them for backward which is not ideal

The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used
in them are seen.

We previously skipped placeholder, because for backward we have a wacky approach were we
ignore input defined unbacked symbols and assumes assertions that uses them are already emitted
in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts]

Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231
Approved by: https://github.com/aorenste
2025-05-02 07:01:48 +00:00
Ke Wen
829752ba37 [SymmMem] Add all_to_all_vdev (#151819)
Merge in/out splits into one tensor

Multi-block

Use sync instead of barrier

Use nvshmemx_collective_launch

Rotate blocks among peer

write back input splits

Parallel scan works

Use scan for output offsets

Use at most 16 blocks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819
Approved by: https://github.com/ngimel, https://github.com/fduwjj
ghstack dependencies: #151261, #151498
2025-05-02 06:59:21 +00:00
Animesh Jain
9e3fc41060 [invoke_subgraph] rename identifiers to prevent python mangling (#152581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581
Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519
ghstack dependencies: #152547
2025-05-02 06:46:05 +00:00
PyTorch MergeBot
4f9f1abd6d Revert "Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539)"
This reverts commit 037343657e.

Reverted https://github.com/pytorch/pytorch/pull/152539 on behalf of https://github.com/wdvr due to failing internal tests - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152539#issuecomment-2846484924))
2025-05-02 06:43:35 +00:00
Ke Wen
d7961a1086 [SymmMem] Add all-to-all (#151498)
Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498
Approved by: https://github.com/fegin, https://github.com/fduwjj
ghstack dependencies: #151261
2025-05-02 06:40:43 +00:00