pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
sanchitintel	7482eb217c	[Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops (#149373 ) ### Summary Fixes #148494 Explicitly prefetch the cache lines of the next `B` block to accelerate int8 WoQ (BF16 activation, int8 statically quantized weights) GEMM for small `M` dimension. Some of this code (outer loops of the GEMM) is being ported over from Intel Extension for PyTorch. The macro-kernel* and the micro-kernel* are essentially the same, but optionally prefetch a block of B. Templatization is being used to prevent branching causing a slowdown due to unnecessary prefetching. \* - in [BLIS](https://dl.acm.org/doi/10.1145/2764454) parlance ### Performance data with BS 1 Machine: 32 cores of one socket of a Intel Xeon SP Gen 5 machine \| Model \| input tokens \| output tokens \| next-token latency before this PR \| Next-token latency after this change \| Speedup \| \|-----------\|-------------\|-----------------\|--------------------------------------\|------------------------------------------\|-----------\| \|GPT-J \| 128 \| 128 \| 42 ms \| 38 ms \| 9.52 % \| \| GPT-J \| 1024 \| 1024 \| 48 ms \| 45 ms \| 6.25 % \| \|LLaMA 3.1 8B Instruct \| 128 \| 128 \| 52 ms \| 47 ms\| 9.61% \| \|LLaMA 3.1 8B Instruct \| 1024 \| 1024 \| 57 ms \| 53 ms\| 7.01% \| While the input shapes of GEMMs corresponding to linear for next-token computation remain the same in case of different number of input & output tokens, the difference in next-token latency is due to attention for those cases Pull Request resolved: https://github.com/pytorch/pytorch/pull/149373 Approved by: https://github.com/leslie-fang-intel, https://github.com/Xia-Weiwen Co-authored-by: Xia Weiwen <xia.weiwen@hotmail.com>	2025-05-15 11:55:58 +00:00
Pat Vignola	6e107899da	[Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508 ) Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices. Test Plan: CI Differential Revision: D74539233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508 Approved by: https://github.com/exclamaforte	2025-05-15 08:41:46 +00:00
Simon Fan	b297e01f4b	[ca][dtensor] run real PG dtensor tests under CA (#152689 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152689 Approved by: https://github.com/bdhirsh ghstack dependencies: #153300	2025-05-15 08:10:35 +00:00
Simon Fan	4863e5c843	[ca][dynamo] always run eager checkpoint region's recomputation in eager (#153300 ) I slap disable on the recomputation hook, otherwise the partitioner may save less/more activations and mismatch with the expected eager count in checkpoint. See code comment `Note: [compiled autograd and checkpoint unpack hook]`. This fixes all non-nested checkpointing tests. I also wrap nested checkpointing tests, and a few of them still fail. This also seems to fix all PYTORCH_TEST_WITH_DYNAMO checkpointing tests except for `TestAutograd.test_checkpointing_without_reentrant_custom_function_works`. For those tests, it looks like we fail to HOPify the checkpointed region and when the backward executes the unpack hooks, dynamo tried to trace them. This messed up the internal state tracking of checkpointing, some raising the _StopRecomputationError and others raising the same count mismatch error as CA. FIXES https://github.com/pytorch/pytorch/issues/127115 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153300 Approved by: https://github.com/jansel	2025-05-15 08:10:35 +00:00
PyTorch MergeBot	71027b13b2	Revert "[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 )" This reverts commit `881a598a1e`. Reverted https://github.com/pytorch/pytorch/pull/153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](https://github.com/pytorch/pytorch/pull/153357#issuecomment-2882915691))	2025-05-15 07:58:27 +00:00
Xia, Weiwen	55784be01b	[Quant][X86] add ops to compute uint8 pointwise add/add_relu (#152411 ) Summary This PR adds two new ops, `onednn.qadd.tensor` and `onednn.qadd_relu.tensor`, for int8 elementwise add, which accepts inputs on CPU device (instead of QuantizedCPU). The new ops are implemented with AVX512 instructions and it provides similar or better performance, depending on shape, than its counterpart for QuantizedCPU device `quantized.add` and `quantized.add_relu`. The new op supports output dtypes other than uint8 (fp32, fp16 and bf16 are supported). Test plan ``` pytest test/quantization/core/test_quantized_op.py -k test_int8_add_onednn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152411 Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2025-05-15 06:23:01 +00:00
Nikita Shulga	d5ddc5ab20	[MPS] Fix float64 scalar tensor handling (#153582 ) Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor Add regssion test for this behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582 Approved by: https://github.com/dcci	2025-05-15 05:15:14 +00:00
Mandar Deshpande	3e8bda4ad5	[pytorch][triton] flex attention fwd kernel with TMA loads (#151923 ) (#152460 ) Summary: Device side TMA for flex_attention fwd kernel, Q K V tensors Test Plan: Unit test: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- test_tma_with_customer_kernel_options ``` https://www.internalfb.com/intern/testinfra/testrun/14355223891618726 Differential Revision: D71082691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152460 Approved by: https://github.com/drisspg	2025-05-15 04:49:32 +00:00
Daniel Vega-Myhre	881a598a1e	[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation (#153357 ) Fixes #147336 ## Context NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM. Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown. To summarize: In flex attention when performing the FP8 GEMM `softmax_scores @ V` the right operand V must be in column-major memory layout. However, the `tl.load` of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation. This is because triton does not perform async copies with the `cp.async` PTX instruction if the number of contiguous bytes is less than 4 (see [here](`81f93f2c8e/lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp (L403)`)). i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores. ## Fix summary - To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs ## Benchmarks Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime. Before fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us 2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us ``` After fix: ``` (flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16 2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us 2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3 2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153357 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-05-15 02:41:38 +00:00
eellison	eaf2dee10e	don't run triton mm for k<32 (#153550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153550 Approved by: https://github.com/suo Co-authored-by: Natalia Gimelshein <ngimel@meta.com>	2025-05-15 02:36:44 +00:00
karthickai	725bbb6b5f	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel	2025-05-15 02:33:57 +00:00
Ting Lu	c2bc7e2827	API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536 ) Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more. For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t Error we see without the change ``` RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))` To execute this test, run the following from the base repo dir: python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536 Approved by: https://github.com/jcaip, https://github.com/atalman	2025-05-14 23:36:53 +00:00
PyTorch MergeBot	f363a3f51a	Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 )" This reverts commit `9386701b51`. Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))	2025-05-14 20:53:49 +00:00
James Wu	dda2c7c8fc	Pass inductor config for static cuda launcher to workers (#153382 ) Async compile workers don't respect inductor configs generally that get changed in the middle of execution because they warm up early. StaticCudaLauncher is especially susceptible to this because it affects triton compilation without being part of the inductor meta. So we'll pass it in via extra configs on each worker run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153382 Approved by: https://github.com/masnesral, https://github.com/jansel	2025-05-14 20:01:32 +00:00
Ben Zickel	a54bf43baa	Fix support of MixtureSameFamily [bugfix]. (#151317 ) Fixes https://github.com/pyro-ppl/pyro/issues/3419 which is actually a `torch` bug that can be replicated by the below code: ``` from torch import rand from torch.distributions import MixtureSameFamily, Categorical, Binomial max_count = 20 probs = rand(10, 5) binom_probs = rand(10, 5) d = MixtureSameFamily(Categorical(probs=probs), Binomial(max_count, binom_probs)) d.log_prob(d.sample()) ``` which results in: ``` Traceback (most recent call last): File "test.py", line 11, in <module> d.log_prob(d.sample()) File "pytorch\torch\distributions\mixture_same_family.py", line 168, in log_prob self._validate_sample(x) File "pytorch\torch\distributions\distribution.py", line 315, in _validate_sample valid = support.check(value) ^^^^^^^^^^^^^^^^^^^^ File "pytorch\torch\distributions\constraints.py", line 307, in check (value % 1 == 0) & (self.lower_bound <= value) & (value <= self.upper_bound) ^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1 ``` ### Fix explanation (only for cases when the component distribution contains parameters with batch dimenisons) - The failure is due to sample validation taking place before padding in `MixtureSameFamily.log_prob`, and hence the fix is to pad before doing sample validation. - The fix itself does not alter the calculations at all. It only affects the sample validation process. - The failure does not occur with the component distribution set to the `Normal` distribution, as its validation is not defined elementwise (the validation itself is elementwise). - I've split the `test_mixture_same_family_log_prob` test into two tests based on the `Normal` and `Binomial` distributions. - Initially, the `Binomial` version of the test did not fail, but this was due to the component distribution having equal batch dimensions of (5, 5) so I changed it to (10, 5). ### Updated fix explanation (for all cases) - The previous fix caused a bug in sample shape validation (which is done correctly) due to the padding taking place before the sample validation. - The updated fix corrects the support to reflect the fact that the support of `MixtureSameFamily` is equal to the support of its components distribution with the first event dimension removed. - This issue was already anticipated in the [code](`331423e5c2/torch/distributions/mixture_same_family.py (L127)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/151317 Approved by: https://github.com/albanD, https://github.com/fritzo	2025-05-14 19:24:36 +00:00
clr	534b66fe30	torch.compile: Remove reference to the unused dynamo_config.dynamic_shapes from (#153297 ) tests This config option is not set anywhere, and does nothing, so this should cause no changes to tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153297 Approved by: https://github.com/Skylion007	2025-05-14 19:02:51 +00:00
PyTorch MergeBot	bf0fe4f828	Revert "[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 )" This reverts commit `ced90d23d3`. Reverted https://github.com/pytorch/pytorch/pull/153101 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages on main, tentative revert: https://github.com/pytorch/pytorch/actions/runs/15024667248/job/42224521705 ([comment](https://github.com/pytorch/pytorch/pull/153101#issuecomment-2881208171))	2025-05-14 18:52:07 +00:00
Nikita Shulga	8749fe8439	[CI][MPS] Speedup test_large_bmm (#153562 ) By computing matmuls of only one random non-zero batch on CPU This reduces test runtime from 11 minutes to 14 sec ``` % python3 test/test_mps.py -v -k test_large_bmm_ test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok ---------------------------------------------------------------------- Ran 2 tests in 27.495s ``` TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562 Approved by: https://github.com/Skylion007, https://github.com/clee2000	2025-05-14 18:49:42 +00:00
angelayi	47d6feff7c	[export] Support no inputs in unflattened module (#153474 ) Encountered in this diff D74589491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153474 Approved by: https://github.com/avikchaudhuri	2025-05-14 18:45:47 +00:00
Ryan Guo	8bb67700a3	[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 ) This is essentially a follow-up on #122098, where we added support of `getattr` and `setattr` on result of `torch.compile(module)`, but didn't add support for `delattr`. Fixes #150711. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741 Approved by: https://github.com/anijain2305 ghstack dependencies: #152740	2025-05-14 17:03:59 +00:00
Ryan Guo	6765df052c	[dynamo] Emit warning on global module hooks when calling using output of `torch.compile(module)` (#152740 ) When we do `torch.compile(module)`, we eventually end up returning a new `OptimizedModule` instance, whose `forward` method is the result of `torch.compile(mod.__call__)`, meaning it already captures all the extra logic (e.g., hook firing) for the compiled module. `OptimizedModule` also inherits `nn.module.__call__`, and thus has its own hook logic. This is useful for torchao, which injects module forward hooks to run in eager for quantization purposes. However, this might create unexpected behavior for global module hooks, because `torch.compile(module)` causes the hook to fire one extra time for `OptimizedModule`, when compared to eager. To preserve BC, we simply emit a warning for this behavior, and let users decide what to do. This is reasonable because the global module hooks are documented to be used for debugging/profiling purposes only. Fixes #149502 Differential Revision: [D74611716](https://our.internmc.facebook.com/intern/diff/D74611716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2025-05-14 17:03:59 +00:00
Shangdi Yu	b3dea0c0dd	Change aoti cpp tests to run serially within file (#152960 ) Fixes #152674 https://github.com/pytorch/pytorch/issues/152889 https://github.com/pytorch/pytorch/issues/152888 https://github.com/pytorch/pytorch/issues/152891 `--dist=loadfile` ensures all tests in the same source file run in the same worker. Tests like `FreeInactiveConstantBufferRuntimeConstantFoldingCuda` expect exclusive access to memory during test time to compute diffs (e.g., initMemory - updateMemory2 == DATASIZE). With `-n 3`, tests run in separate processes, but CUDA device memory is shared — and cudaMemGetInfo() reads device-wide global state. ``` python test/run_test.py --cpp --verbose -i cpp/test_aoti_inference -dist=loadfile ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152960 Approved by: https://github.com/desertfire, https://github.com/cyyever	2025-05-14 17:02:39 +00:00
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
fengqing.lu	de92296bbb	[Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976 ) Fix https://github.com/pytorch/pytorch/issues/152290. The model hubert uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN. This PR handles this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-05-14 16:09:03 +00:00
Shangdi Yu	2e440e39a6	[nativert] Move Placement to pytorch core (#152953 ) Summary: Move Placement to pytorch core. Using `torch::nativert::isSameDevice` explicitly in code to avoid confusion with the `isSameDevice` in torch namespace. Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:placement_test ./bin/test_nativert ``` OSS and internal CI Differential Revision: D74190745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152953 Approved by: https://github.com/Skylion007, https://github.com/swolchok, https://github.com/zhxchen17, https://github.com/cyyever	2025-05-14 15:26:54 +00:00
eqy	ced90d23d3	[CUDA][CUDNN] Dispatch to cuDNN for non-batch-splittable 64-bit NCHW convolutions (#153101 ) For #152816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153101 Approved by: https://github.com/Skylion007	2025-05-14 15:22:47 +00:00
PyTorch MergeBot	2344eca5eb	Revert "Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315 )" This reverts commit `ee096b89f6`. Reverted https://github.com/pytorch/pytorch/pull/151315 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal regressions, see [D74668899](https://www.internalfb.com/diff/D74668899). @malfet may you help the author get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/151315#issuecomment-2880203323))	2025-05-14 13:15:03 +00:00
PyTorch MergeBot	2c1912452d	Revert "Rewrite autograd producer consumer stream sync logic (#151079 )" This reverts commit `f78e4529a9`. Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))	2025-05-14 13:07:12 +00:00
Animesh Jain	864a5f4434	[dynamo][compile-time] Cache the cleaned insturctions while inlining (#153333 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153333 Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/williamwen42	2025-05-14 09:26:26 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
Bin Bao	33a5179269	[AOTI][reland2] Remove typedef for half and bfloat16 (#153467 ) Summary: Reland https://github.com/pytorch/pytorch/pull/151109 after fixing cutlass AOTI build issues. typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the standalone AOTI codegen. Differential Revision: D74398762 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153467 Approved by: https://github.com/jingsh, https://github.com/henrylhtsang, https://github.com/cyyever	2025-05-14 02:37:18 +00:00
Meet Patel	9ad9a04ca7	Add TensorLR variant for fused Adagrad on CPU (#153078 ) This PR adds a tensor LR variant for the CPU Adagrad(fused=True). I copied the behavior from the tensor LR variant of CPU Adam(fused=True), where the `lr.item()` is cast to a double and passed in the default function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153078 Approved by: https://github.com/janeyx99	2025-05-14 02:23:33 +00:00
angelayi	d51bc27378	[export] Make draft_export public (#153219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153219 Approved by: https://github.com/pianpwk	2025-05-14 02:18:36 +00:00
clr	85f97b5a8c	compile_fx: make a compile event that corresponds to the fx_compile waitcounter (#152983 ) This is a pretty minor change, but by having exact correspondence, we can easily confirm data differences between perfetto and wait counters Pull Request resolved: https://github.com/pytorch/pytorch/pull/152983 Approved by: https://github.com/jansel, https://github.com/masnesral	2025-05-14 01:54:42 +00:00
eqy	9386701b51	[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` (#149282 ) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282 Approved by: https://github.com/drisspg	2025-05-14 01:39:24 +00:00
Georg Narodoslawsky	8739a8c288	elastic: do not shutdown rendezvous on leaving workers (#152525 ) In #117066, shutdown of the rendezvous was added if a worker shuts down. This is incorrect, because the rendezvous is actually shutdown in [this file](`fa6f9eb2be/torch/distributed/launcher/api.py (L290)`) but should not be shutdown if a signal is received. See also [this pull request](https://github.com/pytorch/pytorch/pull/67749). #124819 then tried to remediate the situation by fixing the faulty shutdown for the restart case. But this is only triggered if the agent restarts the training, but not if the shutdown of the rendezvous happened before. Removing both these changes restores the original behavior. The rendezvous should only be shutdown if a run completes or fails, not for a single worker leaving. Fixes #150916 Fixes #147064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152525 Approved by: https://github.com/kiukchung	2025-05-14 00:44:10 +00:00
Pian Pawakapan	8ac82c3e72	[export] support functools.partial forward (non-strict) (#153408 ) Fixes #153086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153408 Approved by: https://github.com/tugsbayasgalan	2025-05-13 23:30:13 +00:00
Ryan Guo	8ac82a1d20	[dynamo] Add test to ensure we don't print fx graph upon data dependent graph break (#153416 ) This adds a regression test for #149831, also as part of getting it cherry-picked into 2.7.1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153416 Approved by: https://github.com/atalman	2025-05-13 18:28:02 +00:00
Wanchao Liang	9df9d9ded0	[device_mesh] replace dim_group_info with group_name (#150898 ) as titled, there's no need to maintain a dim_group_info anymore, we can simply maintain a list of group_name instead. This will simplify the logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/150898 Approved by: https://github.com/tianyu-l, https://github.com/fegin	2025-05-13 17:16:45 +00:00
Sam Larsen	dde705864a	Fix test broken by D73809989 (#153413 ) Summary: I forgot to remove this unused field in D73809989. Test Plan: `buck test 'fbcode//mode/opt' fbcode//caffe2/test:fbonly -- --exact 'caffe2/test:fbonly - test_compilation_metrics_logger_in_sync (caffe2.test.fb.test_fb.TestFBOnly)'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153413 Approved by: https://github.com/c00w	2025-05-13 16:44:30 +00:00
Simon Fan	216e28f7e9	[ca] run xfails up until their last passing backend (#153279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153279 Approved by: https://github.com/jansel ghstack dependencies: #153193, #153222	2025-05-13 16:42:10 +00:00
Simon Fan	a80eb84a5f	[ca] support higher order gradients (create_graph=True) (#153222 ) Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager"). Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222 Approved by: https://github.com/jansel ghstack dependencies: #153193	2025-05-13 16:42:09 +00:00
Simon Fan	37efaf4af9	[ca][api] config api shouldn't error with optimize_assert (#153193 ) Toggling on `torch._dynamo.config.compiled_autograd = True` was erroring export (optimize_assert didn't have `rebuild_ctx` defined). Separately add a way to `rebuild_ctx` for `optimize_assert` since it is a public API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153193 Approved by: https://github.com/jansel	2025-05-13 16:42:02 +00:00
Guilherme Leobas	a4459cd4e3	Remove `property` from python_type function (#152900 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152900 Approved by: https://github.com/amjames, https://github.com/anijain2305 ghstack dependencies: #153070	2025-05-13 16:26:25 +00:00
Prachi Gupta	c5ebc12f7f	[ROCm] unkip test_non_standard_bool except for failings ops (#152956 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152956 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-05-13 15:55:42 +00:00
Animesh Jain	7fdd754136	[compile-time traces] Profile large missing gaps in compile time (#151256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151256 Approved by: https://github.com/bdhirsh, https://github.com/masnesral, https://github.com/zou3519, https://github.com/jansel	2025-05-13 14:44:51 +00:00
Wang, Eikan	ee096b89f6	Fix skipIfXpu and skipIfHpu disables tests when used on class (#151315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151315 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-05-13 14:44:17 +00:00
Howard Huang	d9ef1012db	[PP] Optimize memory usage by releasing output memory earlier (#153383 ) Considering `output_chunks` is only used for last stage, we should not keep the outputs of each stage in memory; this will allow memory to be freed earlier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153383 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-05-13 14:42:38 +00:00
Zhang, Jianyi	9f98e37eb4	[Intel GPU] add tf32 support for matmul on XPU (#144240 ) Support xpu tf32 matmul using torch.bachend.mkldnn.allow_tf32, we will discuss in future if we need a new api to control matmul only ~~Support xpu tf32 matmul using torch.set_float32_matmul_precision. For conv, check https://github.com/pytorch/pytorch/pull/137570 We decide not following torch.backends.cuda.matmul.allow_tf32 because this API actually calls setAllowTF32CuBLAS to set matmul_precison to high. We also avoid other related tf32 changes (i.e. in inductor) by not introducing new API.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144240 Approved by: https://github.com/EikanWang	2025-05-13 14:03:01 +00:00
Michael Lazos	ff039d39ec	[Dynamo] Optimize dedupe region ancestor tracking (#152589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572	2025-05-13 12:17:59 +00:00
Michael Lazos	a415c9831f	[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506	2025-05-13 12:17:59 +00:00
Michael Lazos	57dafb90ef	[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410	2025-05-13 12:17:59 +00:00
Michael Lazos	118192011e	[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505	2025-05-13 12:17:59 +00:00
Michael Lazos	3592cb52d9	[Hierarchical Compilation] Use universal flatten APIs (#152505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389	2025-05-13 12:17:59 +00:00
Michael Lazos	023a3dc69f	[Hierarchical Compilation] Track node mutations (#152389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389 Approved by: https://github.com/anijain2305	2025-05-13 12:17:59 +00:00
nikitaved	edc2d539d1	`torch.tensordot`: performance improvements when contracting to a scalar. (#145936 ) As per title. Fixes https://github.com/pytorch/pytorch/issues/145731 Touches only compute. The CPU overhead can potentially be further reduced. Before: ```python In [3]: n = 512 In [4]: A = torch.rand(n, n) In [5]: B = torch.rand(n, n) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After ```python In [2]: n = 512 In [3]: A = torch.rand(n, n) In [4]: B = torch.rand(n, n) In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]]) 30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]]) 141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]]) 142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]]) 62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936 Approved by: https://github.com/albanD, https://github.com/ngimel	2025-05-13 10:57:30 +00:00
PyTorch MergeBot	8d7dec6e92	Revert "[DSD] Don't pop tensors if they are on Meta device (#153185 )" This reverts commit `7243c69421`. Reverted https://github.com/pytorch/pytorch/pull/153185 on behalf of https://github.com/jeanschmidt due to Seems to break internal signals, see [D74577069](https://www.internalfb.com/diff/D74577069) ([comment](https://github.com/pytorch/pytorch/pull/153185#issuecomment-2875662357))	2025-05-13 09:13:27 +00:00
FFFrog	29c8ae825f	[OpenReg] Move SDPA to OpenReg from open_registration_extension.cpp (#153309 ) As the title stated. Next Chages: - Migrate remaining functionality to OpenReg Pull Request resolved: https://github.com/pytorch/pytorch/pull/153309 Approved by: https://github.com/albanD	2025-05-13 03:49:19 +00:00
Nikita Shulga	a6c5b59067	[MPSInductor] Fix multistage reduction suffixes (#153362 ) By invalidating all variable created during the loop except for the context of iterator_cache, as storage can be done inside reduction loop and clear `IteratorRangeEntry` codegen cache. Which results in the following kernel for `x / x.sum()` if x size is 2048 and max thread group size is 1024 ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device half* out_ptr1, constant half* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; threadgroup float tmp_acc_0[32]; float tmp_acc_1 = 0; for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp0 = static_cast<float>(in_ptr0[r0_0]); tmp_acc_1 += tmp0; } auto tmp1 = c10:🤘:threadgroup_sum(tmp_acc_0, tmp_acc_1, r0_index * 1, 1024); for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) { int r0_0 = 2 * r0_index + r0_0_cnt; auto tmp2 = static_cast<float>(in_ptr0[r0_0]); auto tmp3 = tmp2 / tmp1; out_ptr1[r0_0] = static_cast<half>(tmp3); } } ``` Fixes compilation report reported while running `GPUTests.test_pattern_matcher_multi_user_mps` and `GPUTests.test_weight_norm_bwd_mps` Fixes https://github.com/pytorch/pytorch/issues/152155 Though inductor tests are still failing, need to keep refining the variable invalidation Pull Request resolved: https://github.com/pytorch/pytorch/pull/153362 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/jansel	2025-05-13 03:07:53 +00:00
Laith Sakka	8b507a9809	convert guard_size_oblivious to runtime check in infer_size_impl (#148872 ) its ok to check the requirement numel == newsize at runtime in case of unbacked instead of at compile time and assume that its true. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148872 Approved by: https://github.com/bobrenjc93	2025-05-13 00:32:28 +00:00
Natalia Gimelshein	0cf61ca7e4	make use_mem_pool threadlocal (#153356 ) Partial fix for #152861, makes allocation to pool thread-local, but doesn't touch the second bug where multiple threads allocating to multiple pools error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153356 Approved by: https://github.com/Skylion007, https://github.com/eellison	2025-05-13 00:16:07 +00:00
Gabriel Ferns	71c8231742	fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066 ) Summary: https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention Aligns the arguments for the triton inputs Differential Revision: D74085173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066 Approved by: https://github.com/jansel	2025-05-12 23:56:49 +00:00
PyTorch MergeBot	641e4bee67	Revert "[export][cond] support merging constant ints as unbacked symint (#152742 )" This reverts commit `a805911d15`. Reverted https://github.com/pytorch/pytorch/pull/152742 on behalf of https://github.com/ydwu4 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/152742#issuecomment-2874410372))	2025-05-12 23:06:33 +00:00
Benjamin Glass	b0f2891e43	[AOTInductor] Fix clang-tidy warnings in wrapper (#153197 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153197 Approved by: https://github.com/desertfire	2025-05-12 22:35:59 +00:00
Shivam Raikundalia	dbb4444ce3	[Memento] Add PT2 to Memory Snapshot (#152707 ) Summary: To add PT2 information to memory snapshot we piggyback off of the Kineto implementation using record_function similar to adding the user annotations. To do this we add the following: 1. Stack implementation that we instantiate to keep track of which compile context stack we are currently in (top element of the stack). The stack will be per device and thread-local since different threads of a process can be in different compile contexts at a given time. For this reason, we do not need to add mutexes to our stack impl since no two threads will touch a given stack 2. RecordFunction hooks to properly pipe the correct events to the compile context stack. These hooks are similar to the annotation ones in the fact that we just register them lazily and DO NOT unregister them. This is done out of convenience. In the future, we should save the handles and unregister them to minimize overhead after profiling is finished. As of now, we are registering this at the FUNCTION scope which is wide; however, we treat any function that does not start with "Torch-Compiled Region" as a no-op so we anticipate the difference in performance to be negligible during and after profiling. We also hide this feature behind a flag set to off on default so existing jobs will be unaffected 3. Piping for compile context to pickle output Test Plan: In D74039793, we add CompileContext to the visualizer and we see the following {F1977654658} Differential Revision: D74028214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152707 Approved by: https://github.com/eqy	2025-05-12 21:12:51 +00:00
soulitzer	f78e4529a9	Rewrite autograd producer consumer stream sync logic (#151079 ) Also see previous work https://github.com/pytorch/pytorch/pull/142097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079 Approved by: https://github.com/albanD	2025-05-12 21:07:16 +00:00
Yidi Wu	a805911d15	[export][cond] support merging constant ints as unbacked symint (#152742 ) @pianpwk points out that this will be helpful to address several data dependent issues in huggingface [models](`e23705e557/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py (L332)`) with the following pattern: ```python idx = if u0 return 0 else return 1 return x[idx] ``` We could preserve the conditional with a cond. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152742 Approved by: https://github.com/zou3519	2025-05-12 20:26:31 +00:00
Menglu Yu	88a068f33b	[2/n][Optimus][Auto-AC] Support activation quantization with scaling (#151770 ) Summary: Previously, we only support non-scaling quantization, which may lead to overflow, here we support scaling quantization, and set it as the default version. Here, we quantize activation nodes based on the size_in_mb, the default value is 100, i.e., as long as the node has at least 100MB size, we will quantize it. Test Plan: ### how to enable ``` torch._inductor.config.post_grad_fusion_options = { "activation_quantization_aten_pass": { "quant_type": "torch.float8_e5m2", -> default is this type to quantize, you can change the type "use_scaling": False, -> default is False, if you want to use scaling verison, set it to True "size_in_mb": 0.0, -> default is 100, you can tune the value. "exclude_primals": False, -> whether want to exclude quantize parameters, default is False "allowed_dtypes": "torch.float16;torch.bfloat16;torch.float32", -> dtype you consider to quant, use ";" to separate, default is torch.bfloat16 }, } ``` ### toy model ``` buck2 run mode/opt //scripts/qyz/autoac:quantization ``` ``` Epoch [80/200], Loss: 19227.2109 Epoch [100/200], Loss: 1353.5272 Epoch [120/200], Loss: 38630.6758 Epoch [140/200], Loss: 6239.9155 Epoch [160/200], Loss: 6039.1567 Epoch [180/200], Loss: 3994.3569 Epoch [200/200], Loss: 146.3966 ``` Differential Revision: D73015996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151770 Approved by: https://github.com/Mingming-Ding	2025-05-12 19:43:18 +00:00
Aaron Gokaslan	3555ebb63d	[BE]: Update ruff to 0.11.8 (#153249 ) Fixes a ton of false negatives throughout the codebase. RUFF also properly validates NOQA comments now and most of the changes are fixing typos there or removing filewide flake8 suppressions that were also silencing ruff issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153249 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/seemethere	2025-05-12 18:30:52 +00:00
PyTorch MergeBot	5c3fddb9cc	Revert "[Hierarchical Compilation] Track node mutations (#152389 )" This reverts commit `c2936ebfd5`. Reverted https://github.com/pytorch/pytorch/pull/152389 on behalf of https://github.com/jeanschmidt due to Humm, interesting, there seems to be a bug in stack PRs, as it should be part of the stack and be reverted with the other ones ([comment](https://github.com/pytorch/pytorch/pull/152389#issuecomment-2873540451))	2025-05-12 18:18:44 +00:00
Chien-Chin Huang	498f364518	Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286 ) The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this. Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286 Approved by: https://github.com/kwen2501	2025-05-12 17:38:49 +00:00
xinan.lin	dc47295dc5	[Inductor UT][Break XPU] Generalize newly added device-bias code in Inductor UT. (#153355 ) Fixes #153123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153355 Approved by: https://github.com/desertfire, https://github.com/Skylion007	2025-05-12 15:53:05 +00:00
rzou	89aa6eb19b	Stop codegen-ing post_grad_custom_pass in repros (#153243 ) When codegen'ed, it looks like: ```py post_grad_custom_pass = <object at 0x12345678> ``` Which is not runnable at all. Some logic is also trying to deepcopy the object, and not all of these objects are deepcopy-able. This PR skips codegenning of these passes. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/153243 Approved by: https://github.com/houseroad	2025-05-12 15:21:11 +00:00
Colin Peppler	7657d80a58	[aoti] when generating example input shapes, use unbacked replacements (#153220 ) ## Context Suppose we have this graph like this : ``` a: "[s1 + u2, 200]" b: "[u0, 32]" cat: "[s1 + u2, 232]" = torch.cat([a, b], dim=1) ``` NOTE: torch.cat assumes "all tensors must either have the same shape (except in the concatenating dimension) or be a 1-D empty tensor with size (0,)." So, we would expect u0 = s1 + u2 which is guarded on today except it's a deferred runtime assertion since unbacked symints aren't replaced today as Pian. Notice how a has a different symbolic shape than both b and cat. Today, this will create an unexpected shape mismatch when AOTI autotunes. Here's a rough illustration where 8192 is the unbacked symint fallback value. ``` # s1 is an arbitrary integer a = generate_example_value(size=(s1 + 8192, 200)) b = generate_example_value(size=(8192, 32)) out = generate_example_value(size=(s1 + 8192, 232)) triton_cat.run(a, b, out ...) ``` ## Error ``` wrapper.py:1484: <module>: block: [443,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. ... wrapper.py:1484: <module>: block: [443,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tl.broadcast_to(tmp13, [XBLOCK]) < ks0` failed. RuntimeError: CUDA error: device-side assert triggered ``` Differential Revision: [D74485962](https://our.internmc.facebook.com/intern/diff/D74485962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153220 Approved by: https://github.com/desertfire	2025-05-12 15:20:57 +00:00
PyTorch MergeBot	78d752e96a	Revert "[Hierarchical Compilation] Use universal flatten APIs (#152505 )" This reverts commit `f9e3a9058e`. Reverted https://github.com/pytorch/pytorch/pull/152505 on behalf of https://github.com/jeanschmidt due to [TENTATIVE] reverting to check if reverting this stack partially caused the introduction of https://github.com/pytorch/pytorch/actions/runs/14966121510/job/42049638969#step:22:875 ([comment](https://github.com/pytorch/pytorch/pull/152505#issuecomment-2872869990))	2025-05-12 14:48:08 +00:00
soulitzer	cb35a2b15d	Add missing in-place on view check to custom autograd.Function (#153094 ) Fixes https://github.com/pytorch/pytorch/issues/152773 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153094 Approved by: https://github.com/albanD ghstack dependencies: #153005	2025-05-12 14:42:46 +00:00
zhxchen17	a67dd2083c	[dynamo] Guard serialization for SHAPE_ENV (#153258 ) Differential Revision: [D74483150](https://our.internmc.facebook.com/intern/diff/D74483150/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153258 Approved by: https://github.com/jansel ghstack dependencies: #153255, #153256, #153257	2025-05-12 14:42:01 +00:00
zhxchen17	e2f6870c98	[dynamo] Guard serialization for DEFAULT_DEVICE (#153257 ) Differential Revision: [D74483147](https://our.internmc.facebook.com/intern/diff/D74483147/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153257 Approved by: https://github.com/jansel ghstack dependencies: #153255, #153256	2025-05-12 14:42:00 +00:00
zhxchen17	ef1dcc21ee	[dynamo] Guard serialization for global state guards (GRAD_MODE, DETERMINISTIC_ALGORITHMS, TORCH_FUNCTION_STATE, FSDP_TRAINING_STATE) (#153256 ) serialization for global state guards. Differential Revision: [D74483149](https://our.internmc.facebook.com/intern/diff/D74483149/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153256 Approved by: https://github.com/jansel ghstack dependencies: #153255	2025-05-12 14:41:53 +00:00
zhxchen17	0210986cc4	[dynamo] Guard serialization for EMPTY_NN_MODULE_HOOKS_DICT (#153255 ) EMPTY_NN_MODULE_HOOKS_DICT Differential Revision: [D74483148](https://our.internmc.facebook.com/intern/diff/D74483148/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153255 Approved by: https://github.com/jansel	2025-05-12 14:41:44 +00:00
PyTorch UpdateBot	23ecd35a96	Update slow tests (#151207 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151207 Approved by: https://github.com/pytorchbot	2025-05-12 12:05:58 +00:00
PyTorch MergeBot	47df195065	Revert "[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 )" This reverts commit `bc8b305eb8`. Reverted https://github.com/pytorch/pytorch/pull/152410 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	0e36887209	Revert "[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 )" This reverts commit `779e647999`. Reverted https://github.com/pytorch/pytorch/pull/152506 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	53ebcabb52	Revert "[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 )" This reverts commit `50df08eb5e`. Reverted https://github.com/pytorch/pytorch/pull/152570 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
PyTorch MergeBot	aa7fe6af41	Revert "[Dynamo] Optimize dedupe region ancestor tracking (#152589 )" This reverts commit `b5f1345f72`. Reverted https://github.com/pytorch/pytorch/pull/152589 on behalf of https://github.com/jeanschmidt due to Breaking internal signal citadel-fbcode-test-mode-opt-for-pt2_stack_for_internal-linux-0 please see diff [D74531503](https://www.internalfb.com/diff/D74531503) for more details ([comment](https://github.com/pytorch/pytorch/pull/152410#issuecomment-2871168679))	2025-05-12 07:15:09 +00:00
Chien-Chin Huang	7243c69421	[DSD] Don't pop tensors if they are on Meta device (#153185 ) DSD currently will pop tensors if these tensors are on Meta device. This forbid the use cases that users would like to let DCP to directly initialize the tensors when loading. This PR also removes test/distributed/checkpoint/e2e/test_pipeline.py which is based on the above feature that is not realistic and is not used anywhere. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153185 Approved by: https://github.com/mori360	2025-05-12 07:04:59 +00:00
Aaron Gokaslan	032ef48725	[BE]: Add PEP621 project section to pyproject.toml (#153055 ) Follow up to @ezyang's PR #153020 , but better uses PEP621 to reduce redundant fields and pass through metadata better to uv, setuptools, poetry and other tooling. * Enables modern tooling like uv sync and better support for tools like poetry. * Also allows us to set project wide settings that are respected by linters and IDE (in this example we are able centralize the minimum supported python version). * Currently most of the values are dynamically fetched from setuptools, eventually we can migrate all the statically defined values to pyproject.toml and they will be autopopulated in the setuptool arguments. * This controls what additional metadata shows up on PyPi . Special URL Names are listed here for rendering on pypi: https://packaging.python.org/en/latest/specifications/well-known-project-urls/#well-known-labels These also clearly shows us what fields will need to be migrated to pyproject.toml over time from setup.py per #152276. Static fields be fairly easy to migrate, the dynamically built ones like requirements are a bit more challenging. Without this, `uv sync` complains: ``` error: No `project` table found in: `pytorch/pyproject.toml` ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153055 Approved by: https://github.com/ezyang	2025-05-12 02:16:07 +00:00
Yidi Wu	ceb009baee	[map] always turn on dynamo for map (#152041 ) Summary: X-link: https://github.com/pytorch/executorch/pull/10409 Reland D72896450 Make map consistent with other control flow ops. After the change, map is able to support accessing closures in the map fn. Test Plan: See existing tests. Reviewed By: zou3519 Differential Revision: D73138427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152041 Approved by: https://github.com/zou3519	2025-05-12 02:10:08 +00:00
Zhengxu Chen	c51bdf5acf	[export] Exporter API prototype. (#153205 ) Summary: see inline code comments for documentation Test Plan: CI buck2 test --flagfile fbcode//mode/opt fbcode//caffe2/test:test_export -- -r TestPackage Differential Revision: D74426900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153205 Approved by: https://github.com/tugsbayasgalan	2025-05-11 14:20:09 +00:00
henrylhtsang	1f5cf19f56	[cutlass backend] Use src code to generate cutlass gemm name (#153006 ) This shaves off 40s for at least small cases, since we don't have to recompile the kernel again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153006 Approved by: https://github.com/mlazos	2025-05-11 00:57:03 +00:00
PyTorch MergeBot	fdc387ec7c	Revert "refine fp32 precision api (#125888 )" This reverts commit `4c11b26158`. Reverted https://github.com/pytorch/pytorch/pull/125888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some failures on ROCm ([comment](https://github.com/pytorch/pytorch/pull/125888#issuecomment-2869274791))	2025-05-11 00:35:46 +00:00
haozhe.zhu	4c11b26158	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-05-10 11:13:04 +00:00
Michael Lazos	b5f1345f72	[Dynamo] Optimize dedupe region ancestor tracking (#152589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152589 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506, #152570, #152572	2025-05-10 08:27:56 +00:00
Michael Lazos	50df08eb5e	[Hierarchical Compile] Replace tracing alias and mutation check with dynamo impl (#152570 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152570 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410, #152506	2025-05-10 08:27:45 +00:00
Michael Lazos	779e647999	[Hierarchical Compile] Take into account mutation deps in cycle detection (#152506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152506 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505, #152410	2025-05-10 08:27:31 +00:00
Michael Lazos	bc8b305eb8	[Hierarchical Compile] Add mutation dependencies to topological sorting (#152410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152410 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389, #152505	2025-05-10 08:27:19 +00:00
Michael Lazos	f9e3a9058e	[Hierarchical Compilation] Use universal flatten APIs (#152505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152505 Approved by: https://github.com/anijain2305 ghstack dependencies: #152389	2025-05-10 08:27:07 +00:00
Michael Lazos	c2936ebfd5	[Hierarchical Compilation] Track node mutations (#152389 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152389 Approved by: https://github.com/anijain2305	2025-05-10 08:27:01 +00:00
Xilun Wu	bc4cf1c13a	[BE] fix failing test_dp_state_dict_save_load on ROCm CI where world_size=7 (#153283 ) Summary I saw an unrelated CI failure `distributed/_composable/fsdp/test_fully_shard_state_dict.py::TestFullyShardStateDictMultiProcess::test_dp_state_dict_save_load` in one of my PR: https://hud.pytorch.org/pr/pytorch/pytorch/153225#41930032096 This is caused by triggering uneven sharding in FSDP2 at `cbb03e6971/torch/distributed/fsdp/_fully_shard/_fsdp_param.py (L353-L361)` This didn't show up because the cuda CI has even number of GPUs (e.g. 2/4/8) but it's not true on ROCm CI. For the failing CI case, the device number is 7. Solution Skip the test if `self.world_size` can not divide `mlp_dim` (i.e. 16). Test CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/153283 Approved by: https://github.com/fegin, https://github.com/weifengpy	2025-05-10 04:46:32 +00:00
Jake Stevens	b86d46ff21	[torch][ao] Properly strip tracking stats in _fold_conv_bn_qat for 1D (#152982 ) Summary: _fold_conv_bn_qat has logic to remove the tracking stats. Currently, this includes a check that includes only torch.nn.modules.batchnorm.BatchNorm2d. As a result, the tracking stats are not properly removed when 1D is used. This diff updates to fix this. Test Plan: Run N7113483 without this fix. {F1977726982} ``` bento kernel build sensorml ``` Re-run with local version of kernel, containing this diff: {F1977727151} Notice that now, num_batches is removed. Differential Revision: D74269649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152982 Approved by: https://github.com/andrewor14, https://github.com/yushangdi	2025-05-10 01:20:18 +00:00
Natalia Gimelshein	9c99ea2991	error out on negative offs or on K=0 in group gemm (#153226 ) Error out if K=0 in one of the grouped gemms to avoid hangs in #152668 Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already) One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with ``` File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped out = decomp_fn(args, **kwargs) File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias): File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel offs is not None File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__ raise TypeError("cannot determine truth value of Relational") torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational ``` which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for. Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226 Approved by: https://github.com/kwen2501	2025-05-10 01:13:18 +00:00
PyTorch MergeBot	e6dccb036e	Revert "Fix fake tensor caching when output has unbacked (#153034 )" This reverts commit `4f425a0397`. Reverted https://github.com/pytorch/pytorch/pull/153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see `d07fbd41e3/1` ([comment](https://github.com/pytorch/pytorch/pull/153034#issuecomment-2868100487))	2025-05-09 23:43:56 +00:00
Shangdi Yu	ee326137a9	[reland] Add graph module runtime asserts to AOTI (#153182 ) Summary: Solves https://github.com/pytorch/pytorch/issues/151925 A reland of https://github.com/pytorch/pytorch/pull/152125. added a try-except around the justknob internally. Also added more documentation Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph. Also factored out the run time assertion logic to a separate function. We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. One example is below: ``` class Model(torch.nn.Module): def forward(self, a, b, c): nz = torch.nonzero(a) ones = a.new_ones([nz.size(0), b.size(0)]) torch._check(ones.size(0) >= 1) equals = torch.add(ones, c) return equals torch._dynamo.mark_dynamic(c, 0) ``` When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv. See `test_unbacked_equals_input_size_runtime_assertion` in test_aot_inductor. In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below: ``` class Model(torch.nn.Module): def forward(self, x): y = x.reshape(100, -1).clone() y = y + 1 return y dynamic_shapes = { "x": {0: torch.export.Dim.DYNAMIC}, } x.shape[0] needs to be a multiple of 100. ``` See `test_aoti_runtime_asserts_backed_symint` in test_aot_inductor. Example: ``` def forward(self): arg0_1: "f32[s35]"; arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec) # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0) # mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1 add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None return (add_6,) ``` Generated cpp code: ``` auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1); auto arg0_1 = std::move(inputs[0]); auto arg0_1_size = arg0_1.sizes(); int64_t s35 = arg0_1_size[0]; inputs.clear(); auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get()); if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); } ``` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchinductor_dynamic_shapes -- -r test_unbacked_floordiv_simplify TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_sym_i64_input_codegen_cuda TORCHINDUCTOR_SCALAR_ASSERTS_FULL=1 buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_unbacked_equals_input_size ``` Differential Revision: D74361799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153182 Approved by: https://github.com/henrylhtsang	2025-05-09 22:56:19 +00:00
Aaron Orenstein	4f425a0397	Fix fake tensor caching when output has unbacked (#153034 ) We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153034 Approved by: https://github.com/masnesral, https://github.com/tugsbayasgalan	2025-05-09 21:17:54 +00:00
Xilun Wu	cbb03e6971	[BE][DTensor] move torch.distributed._tensor import to torch.distributed.tensor in test files (#153225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153225 Approved by: https://github.com/kwen2501, https://github.com/fegin	2025-05-09 20:40:54 +00:00
Ryan Guo	3976e52264	Fix `torch.isin` decomposition for scalar inputs (#153216 ) This patch fixes a corner case of `torch.isin` decompisition when both inputs are scalars. This pattern showed up from #141196. Fixes #141196. Error stack befor this patch: ``` File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12503, in test_scalar_isin_decomposition res = opt_f() ^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 691, in _fn raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1618, in _call_user_compiler raise BackendCompilerFailed( File "/home/ryanguo99/repos/pytorch/torch/_dynamo/output_graph.py", line 1593, in _call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__ compiled_gm = compiler_fn(gm, example_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/__init__.py", line 2365, in __call__ return compile_fx(model_, inputs_, config_patches=self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_inductor/compile_fx.py", line 2317, in compile_fx return aot_autograd( ^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/backends/common.py", line 106, in __call__ cg = aot_module_simplified(gm, example_inputs, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1179, in aot_module_simplified compiled_fn = AOTAutogradCache.load( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 923, in load compiled_fn = dispatch_and_compile() ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 1164, in dispatch_and_compile compiled_fn, _ = create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 576, in create_aot_dispatcher_function return _create_aot_dispatcher_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/aot_autograd.py", line 826, in _create_aot_dispatcher_function compiled_fn, fw_metadata = compiler_fn( ^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 180, in aot_dispatch_base fw_module, updated_flat_args, maybe_subclass_meta = aot_dispatch_base_graph( # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2199, in _trace_inner t = dispatch_trace( ^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_compile.py", line 51, in inner return disable_fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1223, in dispatch_trace graph = tracer.trace(root, concrete_args) # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_dynamo/eval_frame.py", line 872, in _fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/_symbolic_trace.py", line 850, in trace (self.create_arg(fn(args)),), ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1278, in wrapped out = f(tensors) # type:ignore[call-arg] ^^^^^^^^^^^ File "<string>", line 1, in <lambda> File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 720, in inner_fn outs = fn(args) ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 419, in _functionalized_f_helper f_outs = fn(f_args) ^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 81, in inner_fn outs = fn(args) ^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 902, in functional_call out = PropagateUnbackedSymInts(mod).run( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 171, in run self.env[node] = self.run_node(node) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7387, in run_node result = super().run_node(n) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 240, in run_node return getattr(self, n.op)(n.target, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/interpreter.py", line 320, in call_function return target(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1326, in __torch_function__ return func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_subclasses/functional_tensor.py", line 511, in __torch_dispatch__ outs_unwrapped = func._op_dk( ^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 1428, in __torch_dispatch__ return proxy_call(self, func, self.pre_dispatch, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 797, in proxy_call r = maybe_handle_decomp(proxy_mode, func, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/fx/experimental/proxy_tensor.py", line 2358, in maybe_handle_decomp out = CURRENT_DECOMPOSITION_TABLE[op](args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn result = fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5108, in isin return isin_default(elements, test_elements, invert=invert) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ryanguo99/repos/pytorch/torch/_decomp/decompositions.py", line 5137, in isin_default x = elements.view(elements.shape, ((1,) test_elements.ndim)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: TypeError: view() received an invalid combination of arguments - got (), but expected one of: * (torch.dtype dtype) * (tuple of ints size) While executing %isin : [num_users=1] = call_function[target=torch.isin](args = (%x, %x), kwargs = {}) GraphModule: class GraphModule(torch.nn.Module): def forward(self): # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12498 in f, code: x = torch.tensor(0) x: "i64[][]" = torch.tensor(0) # File: /home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py:12499 in f, code: return torch.isin(x, x) isin: "b8[][]" = torch.isin(x, x); x = None return (isin,) Original traceback: File "/home/ryanguo99/repos/pytorch/test/dynamo/test_misc.py", line 12499, in f return torch.isin(x, x) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153216 Approved by: https://github.com/williamwen42, https://github.com/peterbell10	2025-05-09 20:26:25 +00:00
Pian Pawakapan	d808a3e203	[dynamic shapes] guard_or_false for computeStorageNbytes (#150483 ) removes fast path for computing storage, fixes some adjacent tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/150483 Approved by: https://github.com/laithsakka	2025-05-09 19:31:19 +00:00
Zhengxu Chen	fe11d300ac	[nativert] Improve MPMCQueue tests. (#153154 ) Summary: - Use std::this_thread::yield and stop busy wating. - Sort test file orders. Following up @swolchok's comment from https://github.com/pytorch/pytorch/pull/152837 Test Plan: CI Differential Revision: D74402536 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153154 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-05-09 19:25:42 +00:00
Ti-Tai Wang	90fde0dc09	[ONNX] Support sym_float (#153200 ) Fixes #153115 Note: torch.sym_int is not supported in this PR because it's not appeared in exported program, instead, it's `torch.ops.aten.sym_size.int()`. ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f32[s35, s16]"): # sym_size_int_1: "Sym(s35)" = torch.ops.aten.sym_size.int(x, 0); x = None return (sym_size_int_1,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153200 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2025-05-09 19:10:17 +00:00
Gabriel Ferns	da0b89bcbf	Scheduler Flops refactor (#152708 ) This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes: 1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`. 1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`. Prelude to: https://github.com/pytorch/pytorch/pull/149697 Testing: New unit tests cover functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152708 Approved by: https://github.com/xmfan, https://github.com/eellison	2025-05-09 19:01:43 +00:00
Boyuan Feng	073b0257ba	[Graph Partition] Maintain relative order within partition during reordering (#153111 ) PR #151968 adds `reorder_for_minimizing_partition` for the minimal number of partitions. If reordering two nodes cannot reduce the number of partitions, `reorder_for_minimizing_partition` should maintain the relative order of these two nodes and rely on other reorder passes for some nice features, such as shorter liveness duration or less peak memory. In an extreme case, when all nodes are on gpu and can be cudagraphed, `reorder_for_minimizing_partition` should not reorder any nodes. This PR improves `reorder_for_minimizing_partition` for the invariant: relative order of nodes within the same graph partition are maintained. To do so, we record the index of each node in the input `nodes: list[BaseSchedulerNode]` and use a heap to pop the node with the smallest index. So we always scheduler a node with smaller index in the same graph partition and respects the invariant. Previous implementation tried to use a queue to achieve that but failed. Because node_N at the end may rely on node_1 at the start, such that node_N is added to queue once node_1 is scheduled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153111 Approved by: https://github.com/eellison	2025-05-09 18:49:53 +00:00
henrylhtsang	595e21a9dd	[cutlass-3] Add cutlass key for fbcode and OSS (#153081 ) Differential Revision: [D74337959](https://our.internmc.facebook.com/intern/diff/D74337959/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153081 Approved by: https://github.com/drisspg	2025-05-09 17:38:31 +00:00
Boyuan Feng	ffda46e3be	[Graph Partition] remove weak dep from `partition_input_names` (#152863 ) Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names. The following test failure is fixed by removing weak dependency from partition_input_names: `PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863 Approved by: https://github.com/eellison	2025-05-09 17:20:04 +00:00
Pian Pawakapan	8ea95d2e73	[inductor] dtype promotion error in cat decomp (#152995 ) cloning single tensor wasn't following dtype promotion rules for SAM model: https://github.com/pytorch/pytorch/issues/152606 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995 Approved by: https://github.com/yushangdi, https://github.com/eellison	2025-05-09 16:58:58 +00:00
Ryan Guo	18e13a67ce	[dynamo] Harden torch function dispatchability check for attributes and methods access (#153082 ) See more details in https://github.com/pytorch/pytorch/issues/151771#issuecomment-2836372110. Fixes #151771. Differential Revision: [D74342291](https://our.internmc.facebook.com/intern/diff/D74342291) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153082 Approved by: https://github.com/mlazos	2025-05-09 16:14:23 +00:00
Ankita George	916f6bafe7	Fix HF loading when there's no metadata file to work with fsspec (#152856 ) Summary: HF loading when there is no metadata is an edge case for some users. We were previously calling safe_open(filename) to get the keys in the safetensors file, but this doesn't work with fsspec, when models have a different backend than local fs (ie. hf, s3 etc). This diff updates to open the file with fsspec.open() and then safetensors.deserialize() to get the keys Test Plan: unit test and e2e test reading from hf Differential Revision: D74181513 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152856 Approved by: https://github.com/joecummings	2025-05-09 13:32:01 +00:00
Yu, Guangye	e06a08059a	Add device guard for xpu conv on multi device (#153067 ) # Motivation fixes https://github.com/pytorch/pytorch/issues/153022 The root cause is that the XPU backend registers the convolution op using `m.impl`, which bypasses the device guard logic typically added by the code generation system. This can lead to unexpected behavior if the current device isn't explicitly set. # Additional Context run the following script ```python import torch import torchvision.models as models torch.manual_seed(0) model = models.resnet50(weights="ResNet50_Weights.DEFAULT") model.eval() data = torch.rand(1, 3, 224, 224) device = torch.device('xpu:1') # 'xpu:0' model = model.to(device=device, dtype=torch.float16) data = data.to(device, dtype=torch.float16) with torch.no_grad(): ret = model(data) print(ret) print("Execution finished") ``` The output is ```bash -9.2102e-02, -7.7588e-01, -1.4111e+00, -9.2383e-01, 6.4551e-01, -6.0730e-03, -7.8271e-01, -1.1904e+00, -4.1602e-01, 3.2715e-02, -4.9854e-01, -6.3623e-01, -8.5107e-01, -6.8555e-01, -9.4434e-01, -8.8672e-01, -6.7969e-01, -6.9824e-01, -2.8882e-01, 2.0312e+00]], device='xpu:1', dtype=torch.float16) Execution finished ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153067 Approved by: https://github.com/albanD, https://github.com/EikanWang	2025-05-09 09:41:51 +00:00
Dmitry Rogozhkin	aca2c99a65	xpu: get xpu arch flags at runtime in cpp_extensions (#152192 ) This commit moves query for xpu arch flags to runtime when building SYCL extensions which allows to adjust `TORCH_XPU_ARCH_LIST` at python script level. That's handy for example in ci test which gives a try few variants of the list. CC: @malfet, @jingxu10, @EikanWang, @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/152192 Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/albanD	2025-05-09 05:43:50 +00:00
Michael Lazos	c54aa0da01	[Cutlass] Fix tests (#153196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153196 Approved by: https://github.com/BoyuanFeng	2025-05-09 05:39:05 +00:00
eqy	b30d276abc	[CUDA][cuBLASLt] Fix scale setting for `allowFP16AccumulationCuBLAS` `true` case (#153083 ) Also add some missing `@onlyCUDA` / support check decorators in `test_matmul_cuda.py` Should help resolve #151890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153083 Approved by: https://github.com/janeyx99	2025-05-09 02:27:17 +00:00
Ke Wen	4064062e18	[c10d] Test multiple CUDA Graph captures (#150040 ) 1. Do multiple captures 2. Perform multiple collectives in one capture 3. Multiple replays (existing) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150040 Approved by: https://github.com/fduwjj	2025-05-08 22:14:03 +00:00
Will Feng	d9dc6b56ec	Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112 ) A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error): ``` Traceback (most recent call last): File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call return fn(proxy_args, proxy_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl r = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__ return self._op(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn result = fn(args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm self = self.expand((dim1, dim2, dim3)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers ``` This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112 Approved by: https://github.com/jansel	2025-05-08 21:34:24 +00:00
Pian Pawakapan	4166373908	[dynamic shapes] guard_or_false for infer_size (#152146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152146 Approved by: https://github.com/laithsakka	2025-05-08 21:27:22 +00:00
Boyuan Feng	590965f92f	[Graph Partition][Flex Attention] analyze symints from subgraph inputs and outputs (#152878 ) Flex Attention may have symints in subgraph inputs and outputs. Existing code implicitly captures these symints but does not explicitly store it in TritonTemplateBuffer. This leads to error when analyzing symints used in Flex Attention as a TritonTemplateBuffer. This PR fixes the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152878 Approved by: https://github.com/drisspg	2025-05-08 20:25:35 +00:00
Aidyn-A	086e2c2399	[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ (#152814 ) The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on `sm_120+` machines. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152814 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-05-08 19:34:20 +00:00
Mu-Chu Lee	b3524080dc	[AOTInductor] Generate kernels separately for const graph and main graph (#153040 ) Summary: We should generate the kernel for const graph and main graph separately. The reason is that when we run autotuning, we would create separate kernel calls and we should make sure that main graph also contains the runner. Test Plan: python test/inductor/test_aot_inductor.py -k test_autotune_with_constant_folding Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D74347765](https://our.internmc.facebook.com/intern/diff/D74347765) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153040 Approved by: https://github.com/angelayi	2025-05-08 18:45:45 +00:00
Isuru Fernando	e5f869999c	[inductor] Fix ModularIndexing assumptions (#152993 ) Fixes https://github.com/pytorch/pytorch/issues/151198. Since the result of ModularIndexing can be zero due to the modulo operation, we should not make any assumption about ModularIndexing being positive Pull Request resolved: https://github.com/pytorch/pytorch/pull/152993 Approved by: https://github.com/yf225	2025-05-08 18:26:45 +00:00
Alvaro-Kothe	e86b6b2a19	Add tests to check pretty print when padding is a string in C++ API (#153126 ) Currently there are no tests to verify the behaviour of pretty print when padding is `torch::kSame` or `torch::kValid`. This PR just adds this tests to check for future regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153126 Approved by: https://github.com/Skylion007	2025-05-08 17:55:25 +00:00
PyTorch MergeBot	d36261d2e6	Revert "[dynamo] Avoid running `torch.nn.Module.__call__` twice under `torch.compile(mod)` (#152740 )" This reverts commit `0886d402f1`. Reverted https://github.com/pytorch/pytorch/pull/152740 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))	2025-05-08 17:31:21 +00:00
PyTorch MergeBot	34d424d813	Revert "[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 )" This reverts commit `6c025b5a82`. Reverted https://github.com/pytorch/pytorch/pull/152741 on behalf of https://github.com/huydhn due to Discuss with the author to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/152740#issuecomment-2863779028))	2025-05-08 17:31:21 +00:00
angelayi	3cd69350ed	[export] Unflatten None (#153000 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153000 Approved by: https://github.com/pianpwk	2025-05-08 16:40:13 +00:00
PyTorch MergeBot	7b806a8cb1	Revert "[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 )" This reverts commit `9357635127`. Reverted https://github.com/pytorch/pytorch/pull/152353 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail an inductor test in trunk ([comment](https://github.com/pytorch/pytorch/pull/152353#issuecomment-2863657185))	2025-05-08 16:39:28 +00:00
Jithun Nair	fe8ebacee4	[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368 Approved by: https://github.com/jeffdaily, https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-08 16:12:16 +00:00
PyTorch MergeBot	05326b7e49	Revert "Add runtime asserts to AOTI (#152125 )" This reverts commit `834bc5e414`. Reverted https://github.com/pytorch/pytorch/pull/152125 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152125#issuecomment-2863554139))	2025-05-08 15:58:18 +00:00
Ke Wen	efa07df257	[c10d] Remove unordered PG destroy test (#153110 ) torch.distributed does not support unordered ProcessGroup destroy. Removing the test. Resolves #137507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-05-08 15:29:44 +00:00
Simon Fan	500cbeee4e	[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119 ) Together with https://github.com/pytorch/pytorch/pull/151962, FIXES https://github.com/pytorch/pytorch/issues/133575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152119 Approved by: https://github.com/jansel ghstack dependencies: #151731, #151962	2025-05-08 15:12:16 +00:00
Simon Fan	6dea8ef555	[ca] hide unused scalar int sizes from dynamo (#151962 ) together with https://github.com/pytorch/pytorch/pull/151731, FIXES https://github.com/pytorch/pytorch/issues/113129 https://github.com/pytorch/pytorch/issues/146168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151962 Approved by: https://github.com/jansel ghstack dependencies: #151731	2025-05-08 15:12:16 +00:00
Simon Fan	8f380b239f	[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 ) This is the only way to support dynamic shapes on scalars right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151731 Approved by: https://github.com/jansel	2025-05-08 15:12:08 +00:00
James Wu	4976b1a3a8	Keep raw cubin file around in case it gets deleted underneath us (#153064 ) This diff hardens StaticCudaLauncher in the event a cubin file gets deleted under us. We store the raw cubin on the static cuda launcher, and reload it as needed. On cold start, this can happen if the cubin file is created by triton, and gets deleted before we can load the kernel on the parent process. We don't want to store the entire cubin both in file format and in memory for caching purposes, so we delete it before caching the data. In the unfortunate/unlikely event where we can't load/find the necessary file on warm start, skip the stored triton launcher, falling back to regular triton. This comes at a cost to worker memory, but it's not more memory than regular triton workers already take, so it should be okay. Tests: - Make test_static_cuda_launcher always delete the cubin path and reload it Fixes #153030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153064 Approved by: https://github.com/oulgen, https://github.com/jansel	2025-05-08 14:29:19 +00:00
rzou	2926dd4d8e	Stop proxy-ing autograd.Function.ctx into the graph (#152621 ) The reason why we did this before is because that's how our older autograd.Function x Dynamo interaction work, but we've since adopted newer designs that don't actually need the autograd.Function.ctx proxied into the graph. We still need a fx.Proxy for the autograd.Function.ctx object, so whenever we do I create one via discard_graph_changes. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152621 Approved by: https://github.com/oulgen	2025-05-08 13:32:54 +00:00
ignasa007	22c31046d4	Fixed rerr computation in lobpcg (#152789 ) Fixes #101075 This PR fixes an issue with the computation of residuals in the LOBPCG algorithm. Bug: [Line 788](`8f54e56e62/torch/_lobpcg.py (L788)`) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](`8f54e56e62/torch/_lobpcg.py (L776)`), but it uses the raw eigenvalue-estimates instead of their absolute values. Consequence: This made the algorithm's success sensitive to initialization of eigenvectors. Tests: - I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds). - I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG). - I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully. Let me know if further test cases or benchmarks are needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789 Approved by: https://github.com/pearu, https://github.com/lezcano	2025-05-08 12:22:31 +00:00
Animesh Jain	34d4363e6d	[dynamo] Fix super and classmethod binding of cls object (#153105 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153105 Approved by: https://github.com/jansel ghstack dependencies: #152883	2025-05-08 12:07:08 +00:00
karthickai	9357635127	[inductor][dynamo] Include operator name in size/stride/alignment assertion (#152353 ) Fixes #151930 This PR updates the `assert_size_stride` and `assert_alignment` functions in [guards.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp) to accept an optional `op_name` argument and includes it in the error messages. The corresponding type stubs in [guards.pyi](https://github.com/pytorch/pytorch/blob/main/torch/_C/_dynamo/guards.pyi) are updated to match the new function arg. In [inductor/ir.py](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py) extracts the operator name from the FX graph and passes it into the `codegen_size_asserts` and `codegen_alignment_asserts` functions, so that generated assertions in Triton code include the op name for better debugging. Added unit tests inside [test_torchinductor.py](https://github.com/pytorch/pytorch/blob/main/test/inductor/test_torchinductor.py). - Verified both successful and failing assertion cases include the operator name. - Verified that generated Triton code contains the op name inside the asserts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152353 Approved by: https://github.com/jansel, https://github.com/shunting314	2025-05-08 08:28:05 +00:00
zeshengzong	3c87529d23	Make device check error message more descriptive (#150750 ) Fixes #122757 ## Test Result ```python import torch model_output = torch.randn(10, 5).cuda() labels = torch.randint(0, 5, (10,)).cuda() weights = torch.randn(5) loss_fn = torch.nn.CrossEntropyLoss(weight=weights) loss = loss_fn(input=model_output, target=labels) print(loss) Traceback (most recent call last): File "/home/zong/code/pytorch/../loss2.py", line 17, in <module> loss = loss_fn(input=model_output, target=labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 1297, in forward return F.cross_entropy( ^^^^^^^^^^^^^^^^ File "/home/zong/code/pytorch/torch/nn/functional.py", line 3494, in cross_entropy return torch._C._nn.cross_entropy_loss( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but got weight is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_nll_loss_forward) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150750 Approved by: https://github.com/malfet	2025-05-08 06:19:44 +00:00
rzou	94ca3a4666	Add torch._C.Tag.needs_contiguous_strides (#152859 ) this forces inductor to force the inputs to be contiguous. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/152859 Approved by: https://github.com/eellison	2025-05-08 04:49:59 +00:00
Menglu Yu	2d25e4d478	[1/n][Optimus][Auto-AC] Support activation quantization without scaling (#148380 ) Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017 Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d) Executing actions. Remaining 0/4 6.7s exec time total Command: test. Finished 2 local Time elapsed: 3:11.5s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"} }, ``` Differential Revision: D70522237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148380 Approved by: https://github.com/Mingming-Ding, https://github.com/Hahu803	2025-05-08 04:44:15 +00:00
Animesh Jain	6f6fac6a41	[dynamo] Fix bug in hasattr(tensor, "size") (#152883 ) Fixes https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152883 Approved by: https://github.com/StrongerXi	2025-05-08 01:16:01 +00:00
Shangdi Yu	834bc5e414	Add runtime asserts to AOTI (#152125 ) Summary: Solves https://github.com/pytorch/pytorch/issues/151925 Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph. Also factored out the run time assertion logic to a separate function. We need to generate runtime asserts directly in Inductor instead of just re-using the asserts from input graphs becase we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. One example is below: ``` class Model(torch.nn.Module): def forward(self, a, b, c): nz = torch.nonzero(a) ones = a.new_ones([nz.size(0), b.size(0)]) torch._check(ones.size(0) >= 1) equals = torch.add(ones, c) return equals torch._dynamo.mark_dynamic(c, 0) ``` When we re-use the ShapeEnv in Inductor lowering, the check that checks a and nonzero have the same shape would be evaluted to True after we resolve unbacked bindings using the ShapeEnv. See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor. In addition to the Inductor generated runtime asserts, we also need the runtime asserts from the input graph, because some derived runtime asserts are not generated in Inductor. One example is below: ``` class Model(torch.nn.Module): def forward(self, x): y = x.reshape(100, -1).clone() y = y + 1 return y dynamic_shapes = { "x": {0: torch.export.Dim.DYNAMIC}, } x.shape[0] needs to be a multiple of 100. ``` See test_aoti_runtime_asserts_backed_symint in test_aot_inductor. Example: ``` def forward(self): arg0_1: "f32[s35]"; arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec) # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0) # mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone() view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None # File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1 add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None return (add_6,) ``` Generated cpp code: ``` auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1); auto arg0_1 = std::move(inputs[0]); auto arg0_1_size = arg0_1.sizes(); int64_t s35 = arg0_1_size[0]; inputs.clear(); auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get()); if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); } ``` Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint ``` Differential Revision: D73596786 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125 Approved by: https://github.com/henrylhtsang, https://github.com/jingsh	2025-05-08 00:27:24 +00:00
Michael Lazos	20e2ca3e29	[Dynamo] Allow inlining into AO quantization modules (#152934 ) This adds dynamo inlining into `torch.ao.quantization.fake_quantize`. This is needed for QAT compatbility w/ an RL training model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152934 Approved by: https://github.com/williamwen42	2025-05-07 23:58:11 +00:00
Michael Lazos	6b9d741e1c	[Cutlass] Handle broadcasting in EVT python codegen (#152733 ) Previously merged: * #151713 * #151405 * #150905 * #152306 * #152305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152733 Approved by: https://github.com/eellison	2025-05-07 23:09:02 +00:00
Meet Patel	4270517cbf	Fix test/test_optim.py error message. (#153076 ) Fixes an error message in test/test_optim.py Current behavior: If running the test with Adagrad, the error message reads: "SGD does not currently support capturable". Fix: The error message now says correctly: "Adagrad does not currently support capturable". Pull Request resolved: https://github.com/pytorch/pytorch/pull/153076 Approved by: https://github.com/janeyx99	2025-05-07 22:46:05 +00:00
Eddie Yan	cecfc7dc53	[CUDA][cuDNN] Fix handling of `CPU` side input and target length tensors in `CTCLoss` (#152745 ) https://github.com/pytorch/pytorch/pull/128271 migrated to cuDNN V8 CTCLoss which expects input and target length tensors to be on `CUDA` rather than `CPU` without adding the logic to account for the edge case of them being on `CPU` see also #152421 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152745 Approved by: https://github.com/Skylion007	2025-05-07 22:01:18 +00:00
Ti-Tai Wang	773a91c775	[ONNX] dynamic_shapes uses DYNAMIC (#153065 ) Although Dim.AUTO covers the cases that a user sets more axes to be dynamic than the model actually needs, it silently falls back to STATIC when DYNAMIC fails. This increases the difficulty of debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153065 Approved by: https://github.com/justinchuby	2025-05-07 21:48:41 +00:00
Zhengxu Chen	5bb154e6fd	[nativert] Move MPMCQueue to torch/nativert. (#152837 ) Summary: Torch Native Runtime RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a small library implementing a multi producer multi consumer queue which will be used to synchronize taks for Torch Native Runtime. Differential Revision: D74184245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152837 Approved by: https://github.com/albanD, https://github.com/dolpm, https://github.com/swolchok	2025-05-07 21:17:42 +00:00
Paul Zhang	d2ee606e9b	[Inductor] Set correct baseline for decomposek test (#152897 ) Differential Revision: D74218923 Running on A100 seems to result in precision loss from decompose_k. This was root caused to the fp16/bf16 reduction setting, which establishes a less precise baseline than decompose_k, as decompose_k uses the bmm.dtype overload for fp32 output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152897 Approved by: https://github.com/eellison	2025-05-07 21:02:47 +00:00
henrylhtsang	dd7d231ed3	[cutlass backend][test] re-enable test_cuda_compile_command for fbcode (#153001 ) Differential Revision: [D74284047](https://our.internmc.facebook.com/intern/diff/D74284047/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153001 Approved by: https://github.com/ColinPeppler	2025-05-07 19:06:24 +00:00
Joel Schlosser	d9b8473b59	[Dynamo] Guard serialization for RANGE_ITERATOR_MATCH (#152872 ) Tests serialization for RANGE_ITERATOR_MATCH; includes no non-test changes. This PR handles iterator exhaustion issues by utilizing the janky solution from #152865; it passes a function to generate kwargs and `frame_state.f_locals` is updated with fresh iterators through a second kwarg generation pass after initial tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152872 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728, #152730, #152865	2025-05-07 18:58:18 +00:00
Joel Schlosser	52f7106c00	[Dynamo] Guard serialization for TUPLE_ITERATOR_LEN (#152865 ) Tests serialization for TUPLE_ITERATOR_LEN; includes no non-test changes. Passing a tuple iterator as input results in the iterator being exhausted during testing. I threw together a super janky workaround via accepting a func for kwarg generation and replacing `frame_state.f_locals` with newly-generated kwargs to get fresh iterators, but insights into a better approach are welcome! Pull Request resolved: https://github.com/pytorch/pytorch/pull/152865 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728, #152730	2025-05-07 18:58:18 +00:00
Joel Schlosser	fb500d0b1c	[Dynamo] Guard serialization for SEQUENCE_LENGTH (#152730 ) Tests only; no other changes needed. Test logic uses a tuple function input to trigger installation of a SEQUENCE_LENGTH guard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152730 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727, #152728	2025-05-07 18:58:18 +00:00
Joel Schlosser	42954ab28e	[Dynamo] Guard serialization for CLOSURE_MATCH (#152728 ) Unsupported because it uses unsupported FUNCTION_MATCH. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152728 Approved by: https://github.com/jansel ghstack dependencies: #152725, #152727	2025-05-07 18:58:18 +00:00
Joel Schlosser	a9186ec723	[Dynamo] Guard serialization for FUNCTION_MATCH (#152727 ) Unsupported because it uses unsupported ID_MATCH. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152727 Approved by: https://github.com/jansel ghstack dependencies: #152725	2025-05-07 18:58:18 +00:00
Joel Schlosser	a6f51be2fd	[Dynamo] Guard serialization for NN_MODULE (#152725 ) Throws an error when attempting to serialize an NN_MODULE guard. It is not supported because it uses the unsupported ID_MATCH guard (#152330): `a6dd1c2208/torch/_dynamo/guards.py (L1738-L1739)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152725 Approved by: https://github.com/jansel	2025-05-07 18:58:17 +00:00
eqy	172e641529	[CUDA] Rest peak memory stats before running `test_set_per_process_memory_fraction` (#152540 ) Otherwise previous tests can cause `application = int(total_memory * 0.499) - torch.cuda.max_memory_reserved()` to go negative Hopefully abates current flakiness (see also https://github.com/pytorch/pytorch/issues/135115#:~:text=TestCuda.test_set_per_process_memory_fraction) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152540 Approved by: https://github.com/Skylion007	2025-05-07 17:02:39 +00:00
henrylhtsang	8b9c9a327f	[cutlass backend] cache filtered ops based on layouts (#152580 ) Differential Revision: [D73972687](https://our.internmc.facebook.com/intern/diff/D73972687/) Add cache to store the list of filtered ops for a specific shape + layout + dtype (aka hash on input_nodes). Pull Request resolved: https://github.com/pytorch/pytorch/pull/152580 Approved by: https://github.com/eellison	2025-05-07 16:38:22 +00:00
xinan.lin	56879f64a8	[Break XPU] Fix XPU UT failures introduced by community. (#152945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152945 Approved by: https://github.com/Skylion007, https://github.com/EikanWang	2025-05-07 08:01:31 +00:00
Guilherme Leobas	ae1e51b6ad	Add infra to run CPython tests under Dynamo (#150787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787 Approved by: https://github.com/zou3519	2025-05-07 04:03:14 +00:00
Animesh Jain	ecd74c953f	[dynamo] Recursively realize the stack_values (#152853 ) Might also fix - https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel	2025-05-07 02:36:44 +00:00
Colin Peppler	81b6920c68	[aoti] skip input symbol codegen for sympy expr w/ many symbols (#152579 ) Issue was that - symbol-ids appeared out-of-order w.r.t to the order of the forward inputs ``` def forward(arg0 # [(s3 - 1) + s4, 32], arg1 #[(s3 - 1)] ..) ``` - this causes codegen to fail because it expects all the base symbols `s4,s3` to have been codegen-ed already. - well, we can skip codegen-ing sympy expr with many symbols e.g. `(s3 - 1) + s4` because `s3` and `s4` will be codegen-ed by other inputs. ``` # for example s3 = arg1.size(0) + 1 s4 = argN.size(0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152579 Approved by: https://github.com/jingsh, https://github.com/desertfire	2025-05-07 01:18:09 +00:00
PyTorch MergeBot	a28dcdba2c	Revert "[aot][ca] save bw_module in AOTAutogradCache (#151860 )" This reverts commit `613bd46272`. Reverted https://github.com/pytorch/pytorch/pull/151860 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
PyTorch MergeBot	f6db749e60	Revert "[ca] mark scalar int sizes as dynamic via tensor wrapping (#151731 )" This reverts commit `18229a5300`. Reverted https://github.com/pytorch/pytorch/pull/151731 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:54 +00:00
PyTorch MergeBot	8f208dc75a	Revert "[ca] hide unused scalar int sizes from dynamo (#151962 )" This reverts commit `4555ed8c83`. Reverted https://github.com/pytorch/pytorch/pull/151962 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:53 +00:00
PyTorch MergeBot	64bbf58fb4	Revert "[dynamo][ca] support dynamic annotations on tensors in ListVariables/TupleVariables (#152119 )" This reverts commit `7aebb127bf`. Reverted https://github.com/pytorch/pytorch/pull/152119 on behalf of https://github.com/huydhn due to Chatting with @xmfan and decide to revert and reland this instead ([comment](https://github.com/pytorch/pytorch/pull/151860#issuecomment-2856709646))	2025-05-07 00:56:53 +00:00
Isalia20	56492bfcb9	[MPS] SDPA specialized kernels (#152781 ) Paritally fixes #139668 and #152550 Still work in progress. Following needs to be addressed: - [x] Some tests are failing and need to check why and bugfix - [x] Benchmark the new kernels and add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels) - [x] Add tests to cover the specialized paths(if applicable) - [x] Code cleanup Tested on Macbook M1 Pro ### Vector Fast Path (q_len=1, k_len=256) - Old: 0.378 ms - New: 0.260 ms - 31.2% speed improvement ### Vector 2-pass (q_len=1, k_len=4096) - Old: 0.627 ms - New: 0.370 ms - 41.0% speed improvement ### Vector Fast Path (q_len=8, k_len=256) - Old: 0.545 ms - New: 0.322 ms - 40.9% speed improvement ### Vector 2-pass (q_len=8, k_len=4096) - Old: 1.318 ms - New: 1.057 ms - 19.8% speed improvement Script to get perf: ``` import torch import time def benchmark_sdpa(config, iterations=100): device = config.get("device", "cpu") batch = config["batch"] heads = config["heads"] q_len = config["q_len"] k_len = config["k_len"] head_dim = config["head_dim"] q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32) k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) for _ in range(5): _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() total_time = 0.0 for i in range(iterations): start = time.perf_counter() _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() end = time.perf_counter() total_time += end - start avg_time = total_time / iterations print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations") return avg_time def main(): device = "mps" if torch.backends.mps.is_available() else "cpu" print(f"Running benchmarks on device: {device}") benchmarks = [ { "name": "Vector Fast - Small q_len & moderate k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length triggers vector fast path "k_len": 256, # moderate key length "head_dim": 64, "device": device, }, { "name": "Vector 2-pass - Small q_len & long k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length "k_len": 4096, # long key length triggers the 2-pass variant "head_dim": 64, "device": device, }, # { # "name": "Full Attention - Moderate q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # longer query sequence length # "k_len": 8192, # matching key length for full attention paths # "head_dim": 64, # "device": device, # }, # { # "name": "Full Attention - Longer q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # very long sequence length # "k_len": 8192, # "head_dim": 64, # "device": device, # }, ] iterations = 100 for config in benchmarks: benchmark_sdpa(config, iterations=iterations) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781 Approved by: https://github.com/malfet	2025-05-07 00:40:11 +00:00
Joel Schlosser	2b2b790908	[Dynamo] Guard serialization for CONSTANT_MATCH (#152724 ) This PR adds testing only; no non-test changes were needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152724 Approved by: https://github.com/jansel ghstack dependencies: #152704	2025-05-07 00:36:39 +00:00
Joel Schlosser	6d1e8994d3	[Dynamo] Guard serialization for EQUALS_MATCH (#152704 ) This PR: * Makes no changes to non-test code to support serialization for EQUALS_MATCH * Adds test logic involving a custom-defined constant type to trigger the guard installation here: `72337bdcf2/torch/_dynamo/variables/user_defined.py (L792)` Q: Is there a better way to trigger installation of this guard or is this sufficient? Pull Request resolved: https://github.com/pytorch/pytorch/pull/152704 Approved by: https://github.com/jansel	2025-05-07 00:28:31 +00:00
henrylhtsang	61aa77e216	[cutlass backend][BE][clean-up] refactor to remove use of autotune_fallback_to_aten=True in cutlass backend tests (#152850 ) Differential Revision: [D74192001](https://our.internmc.facebook.com/intern/diff/D74192001/) Motivation: clean up post https://github.com/pytorch/pytorch/issues/147479. I plan to leave the rest of the clean-up as an first time issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152850 Approved by: https://github.com/chenyang78	2025-05-06 23:48:57 +00:00
Ti-Tai Wang	5fa5017479	[ONNX] Suggest users setting dynamo=True when exporting (#152478 ) Fixes #152025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152478 Approved by: https://github.com/justinchuby	2025-05-06 23:18:11 +00:00
Ryan Guo	6c025b5a82	[dynamo] Support `delattr` on result of `torch.compile(module)` (#152741 ) This is essentially a follow-up on #122098, where we added support of `getattr` and `setattr` on result of `torch.compile(module)`, but didn't add support for `delattr`. Fixes #150711. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152741 Approved by: https://github.com/anijain2305 ghstack dependencies: #152740	2025-05-06 22:30:37 +00:00
Ryan Guo	0886d402f1	[dynamo] Avoid running `torch.nn.Module.__call__` twice under `torch.compile(mod)` (#152740 ) When we do `torch.compile(mod)`, we eventually end up returning a new module instance, whose `forward` method is the result of `torch.compile(mod.__call__)`, meaning it already captures all the extra logic (e.g., hook firing) from the default `torch.nn.Module.__call__`. As a result we can't reuse the inherited default `__call__` as is, because we'd end up running the logic twice. This patch makes the returned `OptimizedModule` override the default `__call__`, and directly calls into its compiled `forward` method. Fixes #149502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152740 Approved by: https://github.com/anijain2305	2025-05-06 22:30:37 +00:00
Krishna Bindumadhavan	08f5371571	[float16]: Fix the accumulation type for dot and gemv (#152676 ) Fixes #147860 Also, partially address: https://github.com/pytorch/pytorch/issues/125438 Use float32 for accumulation with float16 and and bfloat16 types Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676 Approved by: https://github.com/malfet	2025-05-06 18:10:08 +00:00
Boyuan Feng	7dd9d514d2	[Graph Partition] remove PRECOMPUTED_SIZE from partition symbol inputs (#152864 ) PRECOMPUTED_SIZE is computed during runtime and should not be included in graph_partition_inputs. See the following example for a PRECOMPUTED_SIZE `ps0`. ![image](https://github.com/user-attachments/assets/5aa949a9-b8e0-4b77-8702-95b96b58694e) full output code: [P1803820480](https://www.internalfb.com/phabricator/paste/view/P1803820480) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152864 Approved by: https://github.com/eellison	2025-05-06 17:35:29 +00:00
Ke Wen	a8f727c439	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-06 15:27:30 +00:00
PyTorch MergeBot	fcd5e49138	Revert "[dynamo] Recursively realize the stack_values (#152853 )" This reverts commit `460888f908`. Reverted https://github.com/pytorch/pytorch/pull/152853 on behalf of https://github.com/malfet due to Looks like it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/152853#issuecomment-2854897485))	2025-05-06 15:02:57 +00:00
Joel Schlosser	b06cbd49f1	[Dynamo] Guard serialization for TENSOR_SUBCLASS_METADATA_MATCH (#152626 ) This PR updates `GuardsStatePickler.reducer_override()` in `torch/_dynamo/guards.py` to handle reconstruction of traceable wrapper subclasses. It's intended to work recursively and handle any level of subclass instance nesting (e.g. subclass instances that contain subclass instances, etc.) This PR tests the guard on several traceable wrapper tensor subclasses: * `LocalSubclass`: used to ensure the correct error message is thrown when the subclass is not defined globally * `torch.testing._internal.two_tensor.TwoTensor`: defines None for its extra metadata * `SubclassWithMeta`: stores non-trivial extra metadata * `SubclassWithCustomMetadataGuard`: stores non-trivial extra metadata and defines a custom `__metadata_guard__` classmethod * `SubclassWithSubclassInnerTensors`: used to test recursiveness; this subclass contains subclass inner tensor components Pull Request resolved: https://github.com/pytorch/pytorch/pull/152626 Approved by: https://github.com/jansel	2025-05-06 14:06:36 +00:00
Blaine Burton Rister	bc11afd41f	[Inductor] FX backend via Wrapper IR (#146942 ) # Sub-PRs These PRs contain refactors from the main one. They should be reviewed and merged first. - https://github.com/pytorch/pytorch/pull/150458 - https://github.com/pytorch/pytorch/pull/152391 - https://github.com/pytorch/pytorch/pull/152587 # Feature The goals of this PR are twofold. ## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen. In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components. This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR. ## Goal 2: Convert Wrapper IR into FX IR. One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc. It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes. The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source. Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa. # Current status Things that seem to work: - Converted a lot of the most common Python codegen lines to Wrapper IR lines. - Handled the following cases, in addition to what was already in the Memory Planning IR: - Comments - Triton kernels - Extern/fallback kernels - Freeing tensors (`del buf0`) - MultiOutput - Graph outputs - ReinterpretView / StorageBox, for both call args and outputs. - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code. - Prototype FX converter which can handle some of the most common use cases. - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565). - Calling wrapped Triton kernels. - Calling extern kernels and certain types of fallback kernels. - Support both `extern_kernels.` and `aten.`. - Support multi-output kernels like `torch.topk`. - Graphs with multiple inputs/outputs. - Training i.e. calling `Tensor.backward()` in a compiled function. - Graph breaks (training). - Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen. Things that don't work: - Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections. - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX. - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR. - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this. # Out-of-tree compilers With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below. ``` from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen class MyCustomBackend(WrapperFxCodegen): def compile_graph(self, gm): # Add 1 to the graph's outputs def compiled_fn(args): return [x + 1 for x in gm.graph.forward(args)] return compiled_fn ``` # Example FX graphs This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`. Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) return (buf0,) ``` Here's a more complicated graph that calls a `torch.addmm` extern kernel. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}}) %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) return (buf2,) ``` Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {}) %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}}) %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}}) return (buf1, buf2) ``` Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {}) return (buf0_view_buf0_0,) ``` Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`. ``` graph(): %s6 : [num_users=0] = placeholder[target=s6] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}}) return buf0 ``` Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions. ``` graph(): %s10 : [num_users=0] = placeholder[target=s10] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s102)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s102, XBLOCK: 64}}) return buf0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942 Approved by: https://github.com/jansel	2025-05-06 10:06:39 +00:00
Yu, Guangye	e32a16a9da	Correct torch.xpu.is_bf16_supported return False if no XPU detected (#152317 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/152301 When XPU is not available, calling `torch.xpu.is_bf16_supported()` still returns `True`, which is inconsistent with the expected behavior (should be False). # Solution Align to other backend, adding `including_emulation` to `torch.xpu.is_bf16_supported` and, - return `False` if XPU is not available - return `True` if `including_emulation` is True - return `torch.xpu.get_device_properties().has_bfloat16_conversions` if `including_emulation` is False, it means if the device could generate SPIRV code for bf16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152317 Approved by: https://github.com/EikanWang	2025-05-06 10:03:17 +00:00
Animesh Jain	460888f908	[dynamo] Recursively realize the stack_values (#152853 ) Might also fix - https://github.com/pytorch/pytorch/issues/135696 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152853 Approved by: https://github.com/Lucaskabela, https://github.com/mlazos, https://github.com/jansel	2025-05-06 06:30:31 +00:00
Animesh Jain	97dfd8dd53	[invoke_subgraph] Run missing graph passes recursively (#152675 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152675 Approved by: https://github.com/bdhirsh, https://github.com/zou3519 ghstack dependencies: #152772, #152770	2025-05-06 02:55:34 +00:00
Animesh Jain	cc254eaa7c	[inductor][refactor] Refactor the fetching of subgraph names (#152770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152770 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #152772	2025-05-06 02:55:34 +00:00
Animesh Jain	b1d34acac5	[fx] Recursive DCE on subgraphs (#152772 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152772 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2025-05-06 02:55:34 +00:00
Yu, Guangye	35c727e7ff	Fix typo on `test_multi_device_context_manager` for XPU (#152812 ) # Motivation Align https://github.com/pytorch/pytorch/pull/152474, fix the typo on UT for XPU introduced by https://github.com/pytorch/pytorch/issues/148864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152812 Approved by: https://github.com/EikanWang, https://github.com/Skylion007	2025-05-06 02:51:19 +00:00
Yiming Zhou	1d7728056b	[nativert] Move TensorMeta to pytorch core (#152475 ) Summary: Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72 This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/` Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later. Test Plan: Added test under `test/cpp/nativert/test_tensor_meta.cpp` Differential Revision: D73820548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475 Approved by: https://github.com/albanD	2025-05-06 01:50:46 +00:00
David Berard	f097e83369	[inductor][retry] Realize bucketize/searchsorted output (#152858 ) Context: bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro: https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554 shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\\15 binary searches, the fused version does 2\\15 * 384 - one for each of the broadcasted outputs. Solution: Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on. After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled). Retry Original PR (https://github.com/pytorch/pytorch/pull/152644) was reverted due to internal bisect blaming this change, but the bisect was a false positive (and is marked as such) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152858 Approved by: https://github.com/aakhundov	2025-05-06 01:32:26 +00:00
drisspg	14f8066910	Ensure mxfp8 scaled_mm works w/ max-autotune (#152744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152744 Approved by: https://github.com/Skylion007	2025-05-06 01:16:57 +00:00
Jane Xu	4979ca5ffa	Synchronize in foreach tests after profiling (#152857 ) After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857 Approved by: https://github.com/davidberard98	2025-05-06 00:56:48 +00:00
Pian Pawakapan	13dcf80a53	[dynamic shapes] use try-catch instead of guard_or_true for reshape_view_helper (#152638 ) Test Plan: test_export Differential Revision: D74033649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152638 Approved by: https://github.com/laithsakka	2025-05-06 00:54:24 +00:00
PyTorch MergeBot	103fe856e1	Revert "Add infra to run CPython tests under Dynamo (#150787 )" This reverts commit `7c96dd8f0c`. Reverted https://github.com/pytorch/pytorch/pull/150787 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a failed test is showing up in trunk ([comment](https://github.com/pytorch/pytorch/pull/150787#issuecomment-2852818113))	2025-05-06 00:20:02 +00:00
PyTorch MergeBot	cc954848d4	Revert "[c10d] Fix extra CUDA context created by barrier (#149144 )" This reverts commit `457fa820ad`. Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))	2025-05-05 22:56:50 +00:00
Felix Su	2ce6d169fc	[IR] Input Adapter refactor prototype (#152459 ) (#152575 ) Summary: 1. Adding `input` field to `_adapt_flat_args` function 2. In `process_forward_inputs`, `reorder_kwargs` will now do nothing if no kwargs are provided (previously would error) 3. Pass `args` as input to `_adapt_flat_args` These changes are made to update the InputAdapter see more context in D73811508 Test Plan: see D73811508 Differential Revision: D73945419 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152575 Approved by: https://github.com/angelayi	2025-05-05 22:51:58 +00:00
albanD	22d1359bc6	Move warning from item to specific number conversions (#152709 ) Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709 Approved by: https://github.com/malfet, https://github.com/ngimel	2025-05-05 20:46:05 +00:00
PyTorch MergeBot	99dac7005f	Revert "[Inductor] FX backend via Wrapper IR (#146942 )" This reverts commit `a7691140a0`. Reverted https://github.com/pytorch/pytorch/pull/146942 on behalf of https://github.com/malfet due to Looks like it indeed breaks lint, see `a7691140a0/1` ([comment](https://github.com/pytorch/pytorch/pull/146942#issuecomment-2852192778))	2025-05-05 20:01:29 +00:00
Blaine Burton Rister	a7691140a0	[Inductor] FX backend via Wrapper IR (#146942 ) # Sub-PRs These PRs contain refactors from the main one. They should be reviewed and merged first. - https://github.com/pytorch/pytorch/pull/150458 - https://github.com/pytorch/pytorch/pull/152391 - https://github.com/pytorch/pytorch/pull/152587 # Feature The goals of this PR are twofold. ## Goal 1: Introduce Wrapper IR as an intermediate step in wrapper codegen. In addition to Triton/C++/Halide kernels, Inductor also generates "wrapper" code which allocates memory and calls the kernels. Originally, this wrapper code was fairly standard Python which resembled a user-written PyTorch program. Over time, various wrapper code generators have been added to accommodate things like AOTInductor, which prefers C++ code for static compilation. This complexity has bled into other parts of the codebase, as we now need if/else statements to choose between Python and C++ macros. (See an example [here](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/ir.py#L5515-L5522).) Since most of these code generation steps are conceptually identical across target languages, it seems reasonable to refactor them into some kind of intermediate representation which can be shared between the various backends. This might also make it easier to develop out-of-tree backends which cannot put their own macros in core Inductor components. This PR takes some initial steps to formalize Inductor's wrapper codegen by generalizing the existing Memory Planning IR into a fully fledged Wrapper IR. This is pretty much identical to the existing Memory Planning IR, but it supports a richer set of ops for things like kernel definitions and calls. This refactor could help encapsulate wrapper codegen. Ideally, we don't need to worry about direct Python/C++ codegen in the main compiler files such as `ir.py`, and can instead defer these to classes like `PythonWrapperCodegen` and `CppWrapperCpu`, which operate on the Wrapper IR. ## Goal 2: Convert Wrapper IR into FX IR. One of the main benefits of Wrapper IR is to enable more diverse Inductor backends. This PR introduces a converter from Wrapper IR into [FX IR](https://pytorch.org/docs/stable/fx.html), which is the intermediate representation most commonly used in PyTorch graph compilers. The purpose of this is to enable out-of-tree backends to consume Inductor's output in FX IR, which would hopefully make Inductor easier to leverage in novel compilers, hardware accelerators, etc. It's not trivial to generate Python or C++ code which Inductor can compile and run, and doing so may require changes to other core Inductor files, for the reasons outlined in the previous section. The goal of supporting FX output is to enable something like `torch.compile`'s [custom backend](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html) system, in which an out-of-tree backend can receive an optimized FX graph from Inductor, and compile and run it however it likes. The typical users of this feature would likely not be part of PyTorch, and may or may not support running a kernel in eager mode. However, they can understand what `torch.empty_strided` means, compile and run Triton kernels, etc. So we just need to present them with an FX graph saying what code Inductor wants to run, which should be easier to analyze and transform in a third party system than Python or C++ source. Since FX IR is fairly stable, this mechanism should hopefully isolate third-party backends, hardware accelerators, etc. from the implementation details of Inductor, and vice versa. # Current status Things that seem to work: - Converted a lot of the most common Python codegen lines to Wrapper IR lines. - Handled the following cases, in addition to what was already in the Memory Planning IR: - Comments - Triton kernels - Extern/fallback kernels - Freeing tensors (`del buf0`) - MultiOutput - Graph outputs - ReinterpretView / StorageBox, for both call args and outputs. - FX conversion asserts that the program only contains Wrapper IR lines, and not strings of Python/C++ code. - Prototype FX converter which can handle some of the most common use cases. - Defining Triton kernels, and putting them in a side table using TorchDynamo's existing [utilities](https://dev-discuss.pytorch.org/t/higher-order-operators-2023-10/1565). - Calling wrapped Triton kernels. - Calling extern kernels and certain types of fallback kernels. - Support both `extern_kernels.` and `aten.`. - Support multi-output kernels like `torch.topk`. - Graphs with multiple inputs/outputs. - Training i.e. calling `Tensor.backward()` in a compiled function. - Graph breaks (training). - Run the `torch.fx.GraphModule` on GPU using the standard `__call__` method. This makes it easy to test the correctness of FX codegen. Things that don't work: - Both Wrapper IR and Wrapper -> FX coverage are currently best effort. There are still features which aren't captured as Wrapper IR lines, and fall back to plain strings. This representation is functionally correct but probably not rich enough to achieve the goals outlined in the previous sections. - Fallback kernels seem like the most difficult thing to fully cover, since they each define their own Python/C++ macros that would need to be converted to FX. - Size/alignment asserts are currently disabled via the config file. It's possible to generate FX IR for these, but it seems reasonable to defer these sanity checks to a later PR. - CommBuffer's and distributed communication are not yet supported. An earlier version of this PR attempted to implement this by calling `empty_strided_p2p`. However, building and testing distributed support seems non-trivial, so it's probably better to defer this. # Out-of-tree compilers With this PR, out of tree backends will be able to do further compilation on the FX graphs by subclassing `WrapperFxCodegen` and overriding the `compile_graph` function. This follows the same API as torch.compile's [custom backends](https://pytorch.org/docs/stable/torch.compiler_custom_backends.html), where the user simply returns a callable running the graph. The callable need not be a method of `GraphModule` or any other PyTorch class. See an example below. ``` from torch._inductor.codegen.wrapper_fxir import WrapperFxCodegen class MyCustomBackend(WrapperFxCodegen): def compile_graph(self, gm): # Add 1 to the graph's outputs def compiled_fn(args): return [x + 1 for x in gm.graph.forward(args)] return compiled_fn ``` # Example FX graphs This section contains some example FX graphs generated by Inductor. The correctness of these graphs was verified against eager mode by calling the corresponding `GraphModule`. Here's an FX graph calling a basic Triton kernel. Notice how outputs are allocated with `torch.empty_strided`, and the Triton kernel is called by reference to Dynamo's triton side table. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((8,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, in_ptr1: %arg0_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) return (buf0,) ``` Here's a more complicated graph that calls a `torch.addmm` extern kernel. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=2] = placeholder[target=arg1_1] %buf0 : [num_users=3] = call_function[target=torch.empty_strided](args = ((), ()), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(1,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg1_1, out_ptr0: %buf0, xnumel: 1, r0_numel: 129, XBLOCK: 1}}) %buf2 : [num_users=2] = call_function[target=torch.empty_strided](args = ((129, 1), (1, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %addmm : [num_users=0] = call_function[target=torch.addmm](args = (%buf0, %arg0_1, %arg1_1), kwargs = {alpha: 1, beta: 1, out: %buf2}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) return (buf2,) ``` Here's a graph which indexes into a tuple using `operator.getitem`. This is necessary to use the output of the `torch.topk` operation. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %buf0 : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%arg0_1, 2), kwargs = {}) %buf1 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 0), kwargs = {}) %buf2 : [num_users=2] = call_function[target=operator.getitem](args = (%buf0, 1), kwargs = {}) %delete : [num_users=0] = call_function[target=torch._inductor.codegen.wrapper_fxir.delete](args = (%buf0,), kwargs = {}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf1, xnumel: 2, XBLOCK: 2}}) %triton_kernel_wrapper_mutation_1 : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 1, constant_args_idx: 1, grid: [(2,)], tma_descriptor_metadata: {}, kwargs: {in_out_ptr0: %buf2, xnumel: 2, XBLOCK: 2}}) return (buf1, buf2) ``` Here's a graph that reinterprets an output tensor using `torch.as_strided`. This is one way to handle Inductor's `ReinterpretView` op. ``` graph(): %arg0_1 : [num_users=1] = placeholder[target=arg0_1] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((2, 4), (4, 1)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [(8,)], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg0_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: 8, XBLOCK: 8}}) %buf0_view_buf0_0 : [num_users=1] = call_function[target=torch.as_strided](args = (%buf0, (8,), (1,), 0), kwargs = {}) return (buf0_view_buf0_0,) ``` Here's a graph with dynamic shapes. This one is a little bit funky. Inductor provides a graph input for each shape symbol, which we map to a placeholder, in this example `s6`. Then, shape expressions in the generated code can refer to the symbol `s6`. The size hint for `s6` is stored in `node.meta["val"]` where `node` is the placeholder defining it. This works out in the generated python code because the placeholder defines a Python variable with the name `s6`. ``` graph(): %s6 : [num_users=0] = placeholder[target=s6] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ((s6,), (1,)), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((-s6)//8)), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s6, XBLOCK: 8}}) return buf0 ``` Here's another graph, this time with dynamic shapes and strides. The grid expression is more complex since the numel is a product of dimensions. ``` graph(): %s10 : [num_users=0] = placeholder[target=s10] %arg1_1 : [num_users=1] = placeholder[target=arg1_1] %arg2_1 : [num_users=1] = placeholder[target=arg2_1] %buf0 : [num_users=2] = call_function[target=torch.empty_strided](args = ([s10, s10], [s10, 1]), kwargs = {dtype: torch.float32, device: cuda:0}) %triton_kernel_wrapper_mutation : [num_users=0] = call_function[target=torch.ops.higher_order.triton_kernel_wrapper_mutation](args = (), kwargs = {kernel_idx: 0, constant_args_idx: 0, grid: [[-(((s102)//(-64))), 1, 1]], tma_descriptor_metadata: {}, kwargs: {in_ptr0: %arg2_1, in_ptr1: %arg1_1, out_ptr0: %buf0, xnumel: s102, XBLOCK: 64}}) return buf0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146942 Approved by: https://github.com/jansel	2025-05-05 19:34:49 +00:00
Ryan Guo	51e77f3b30	[dynamo] replace `unimplemented` with `unimplemented_v2` in `variables/torch_functions.py` (#151278 ) This addresses part of #147913. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151278 Approved by: https://github.com/Skylion007, https://github.com/williamwen42 ghstack dependencies: #151277	2025-05-05 18:45:40 +00:00
Ryan Guo	9e24f9b523	[dynamo] replace `unimplemented` with `unimplemented_v2` in `variables/functions.py` (#151277 ) This addresses part of #147913. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151277 Approved by: https://github.com/Skylion007, https://github.com/williamwen42	2025-05-05 18:45:40 +00:00
zhxchen17	fd6d4a6a24	[dynamo] Guard serialization for DICT_KEYS_MATCH (#152723 ) DICT_KEYS_MATCH Differential Revision: [D74091886](https://our.internmc.facebook.com/intern/diff/D74091886/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152723 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687, #152716, #152721	2025-05-05 18:05:56 +00:00
zhxchen17	2da9ab4b1c	[dynamo] Guard serialization for MAPPING_KEYS_CHECK (#152721 ) MappingProxyType Differential Revision: [D74091363](https://our.internmc.facebook.com/intern/diff/D74091363/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152721 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687, #152716	2025-05-05 18:05:56 +00:00
zhxchen17	24e1666b3a	[dynamo] Guard serialization for WEAKREF_ALIVE (#152716 ) Punt on WEAREF_ALIVE as weakref won't live across the process and users might need to drop them upfront. Differential Revision: [D74088735](https://our.internmc.facebook.com/intern/diff/D74088735/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152716 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616, #152687	2025-05-05 18:05:56 +00:00
zhxchen17	2cb16df6e2	[dynamo] Guard serialization for DUPLICATE_INPUT. (#152687 ) Seems this guard is not very active. Adding a test to detect error handling at least. Differential Revision: [D74074837](https://our.internmc.facebook.com/intern/diff/D74074837/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152687 Approved by: https://github.com/jansel ghstack dependencies: #152615, #152616	2025-05-05 18:05:56 +00:00
zhxchen17	ffd58293f7	[dynamo] Guard serialization for FUNCTORCH_STACK_MATCH (#152616 ) Make Functorch interpreters serializable most of the time, so that we can save the guards on functorch states. ## Test Cases: 0. torch.compile() without functorch layers present. Guard should fail with any layer being pushed. 1. torch.compile() nested in vmap. 2. torch.compile() nested in grad. 3. torch.compile() nested in jvp + vmap 4. torch.compile() nested functionalize 5. torch.compile() nested in vmap + grad Differential Revision: [D74008787](https://our.internmc.facebook.com/intern/diff/D74008787/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152616 Approved by: https://github.com/zou3519 ghstack dependencies: #152615	2025-05-05 18:05:56 +00:00
zhxchen17	1d1cbcd8a3	[dynamo] Guard serialization for DUAL LEVEL. (#152615 ) Seem dual level counter should be stored in OutputGraph so that the value can be preserved through roundtripping. Differential Revision: [D74008786](https://our.internmc.facebook.com/intern/diff/D74008786/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152615 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-05-05 18:05:56 +00:00
Alexander Grund	99287b170b	Generate test reports for pytest when option is given (#152170 ) The argument needs to be appended when test reports should be generated. IS_CI is not necessarily set, so rather check TEST_SAVE_XML instead as in other places where test reports are conditionally enabled. See also https://github.com/pytorch/pytorch/issues/126523 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152170 Approved by: https://github.com/Skylion007	2025-05-05 17:46:40 +00:00
Brian Hirsh	131da0a982	Add a test for AsyncCollectiveTensor handling for maybe-view ops (#152688 ) We never added a proper test for the fix from https://github.com/pytorch/pytorch/pull/134661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152688 Approved by: https://github.com/kwen2501 ghstack dependencies: #152195	2025-05-05 17:21:00 +00:00
Brian Hirsh	5abe74857a	SAC: fix recompute tag propagation for ops with list[tensor] inputs (#152195 ) There's an "are we compiling" check in SAC, which we rely on to know when to propagate recompute tags during tracing. This check was a bit brittle, and missed cases where input ops accept list of tensors - I updated it to check if a `FunctionalTensorMode` is active, which should be a 100% reliable way to know if AOTDispatcher is in the middle of running. There is a long-standing followup here around unifying `torch.compiler.is_compiling()` to work in all cases. We should probably just update it to always check if FakeMode/FunctionalMode are active and use it there. This has a bit of BC risk though so I opted for the more local fix to SAC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152195 Approved by: https://github.com/soulitzer	2025-05-05 17:21:00 +00:00
Guilherme Leobas	7c96dd8f0c	Add infra to run CPython tests under Dynamo (#150787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150787 Approved by: https://github.com/zou3519	2025-05-05 17:20:14 +00:00
Ke Wen	7a2df6a00b	[PGNCCL] Add FP8 support (#152706 ) NCCL added support for `Float8e4m3` and `Float8e5m2` in 2.24. NVIDIA GPUs does not seem to support the following "no negative zero" versions: `Float8_e4m3fnuz` and `Float8_e5m2fnuz`, see https://onnx.ai/onnx/technical/float8.html. So we continue to error out for these two upon a reduction op. Test plan: - test_allreduce_float8 - test_reduce_scatter_float8 Resolves #148344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152706 Approved by: https://github.com/d4l3k, https://github.com/eqy, https://github.com/fduwjj, https://github.com/cyyever	2025-05-05 16:02:27 +00:00
Nikita Shulga	fe36d7dc44	[MPSInductor] Fix `truncdiv` implementation (#152788 ) For integral dtypes it should be just an alias for division Fixes `GPUTests.test_div7_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152788 Approved by: https://github.com/dcci, https://github.com/jansel ghstack dependencies: #152663, #152515, #152737, #152743, #152758	2025-05-05 13:31:51 +00:00
Isuru Fernando	0a470dc7c1	[inductor] fix lowering for cummin, cummax for one element tensors (#151931 ) Fixes https://github.com/pytorch/pytorch/issues/151738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151931 Approved by: https://github.com/eellison	2025-05-05 13:05:59 +00:00
Nikita Shulga	0ffd31dc8a	[MPS] Migrate div roudning modes (#152758 ) By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25 ``` import torch print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc")) tensor([[24., 3.]]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515, #152737, #152743	2025-05-05 03:02:29 +00:00
James Wu	93d8f6ee32	[reland] Detailed triton kernel logging (#152694 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152694 Approved by: https://github.com/Skylion007	2025-05-05 02:46:57 +00:00
Dharak Kharod	a78eec88b8	Implement util function compute_global_tensor_shape for 1D device mesh (#152751 ) ### Summary Recreating #151990 to mitigate easyCLA failure compute_global_tensor_shape util function takes in local tensor shape, device mesh and placements. We all gather the shapes from the shards and according to the placement type we construct the global shape. Note: currenty only implemented for placement type Shard and Replicate, TODO for StridedShared ### Test `pytest test/distributed/tensor/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152751 Approved by: https://github.com/XilunWu	2025-05-05 02:44:31 +00:00
Aaron Gokaslan	49b9efdf1f	[BE]: Cleanup traceutils with fmtlib (#152265 ) Simplify code and make it faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152265 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-04 22:27:19 +00:00
Julius Herb	8f54e56e62	Add optional device index to AOTIModelPackageLoader (#152093 ) This is my suggestion for resolving #152087 This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093 Approved by: https://github.com/desertfire	2025-05-04 11:40:12 +00:00
FFFrog	fd8fd01d25	[OpenReg] Add _lazy_init and rng_state support for OpenReg (#151914 ) As the title stated. Changes: - Add get_rng_state & set_rng_state support for OpenReg - Add _lazy_init support for OpenReg - Remove redundant code for cuda/Module.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/151914 Approved by: https://github.com/albanD	2025-05-04 09:42:08 +00:00
PyTorch MergeBot	8faa225695	Revert "[inductor] Realize bucketize/searchsorted output (#152644 )" This reverts commit `9ae4906b21`. Reverted https://github.com/pytorch/pytorch/pull/152644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/152644#issuecomment-2848743442))	2025-05-03 18:16:39 +00:00
Joe Wang	6ae690f8f0	add support for 0 size shardedTensor and recalculate metadata from all_gather (#152583 ) Summary: change set 1. a ShardedTensor could have 0 size initially, the current check won't pass if the size is 0, added here 2. when we call ShardedTensor._init_from_local_shards, it will assume all the metadata is correct, all_gather to double check. In the new case, the metadata could be all 0 size, and the tensor has actual size, we need to provide such capability to recalculate the local/global metadata from the local tensor by all_gathering the information Test Plan: i don't see a UT is associated, I have tested this with diff stack, D73274786. Differential Revision: D73903933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152583 Approved by: https://github.com/q10, https://github.com/fduwjj	2025-05-03 17:26:29 +00:00
rzou	762844355e	Make DispatchKeySet serializable; add `__eq__` (#152732 ) These seem like reasonable things to add. Also fixes a bug in vLLM for me. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/152732 Approved by: https://github.com/bdhirsh	2025-05-03 14:40:06 +00:00
PyTorch MergeBot	cc28b43950	Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 )" This reverts commit `844842dfbf`. Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))	2025-05-03 08:31:31 +00:00
Ke Wen	457fa820ad	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-03 03:13:34 +00:00
PaulZhang12	84aa0985fb	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-03 02:23:54 +00:00
Anatoly Myachev	5e9682719f	[Inductor UT] Generalize device-bias code in `test_flex_attention.py` (#151937 ) @EikanWang @etaf @guangyey please take a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/151937 Approved by: https://github.com/drisspg	2025-05-03 01:12:49 +00:00
rzou	3d777bae10	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #148104	2025-05-03 00:02:24 +00:00
rzou	2b37a726e0	Refactor layout constraint selection logic (#148104 ) This PR: - cleans up some existing comments that don't make sense anymore - hooks up the "custom_op_default_layout_constraint" back (that seems to have broken) - cleans up the "lazy registration path" which seems to never get hit anymore - adds dislike_padding to nodes that require exact strides Test Plan: - tests + CI disable padding Pull Request resolved: https://github.com/pytorch/pytorch/pull/148104 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-05-03 00:02:24 +00:00
iupaikov-amd	730a077d48	[ROCm] Unskipped test_rnn_dropout_state for ROCm (#152339 ) Unskipping the test, should work fine now. Related PR: https://github.com/pytorch/pytorch/pull/144572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152339 Approved by: https://github.com/jeffdaily	2025-05-02 22:02:30 +00:00
Thomas Bohnstingl	ea12a38668	[associative_scan] Refactoring of input checking and dynamo invocation (#148657 ) This PR is the counterpart of https://github.com/pytorch/pytorch/pull/142125 for the associative_scan operation. The way the input checks are performed and the combine_fn is not invoked in the frontend to check the output trees, but rather dynamo is used for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148657 Approved by: https://github.com/ydwu4	2025-05-02 21:39:28 +00:00
NikhilAPatel	8afe40bc5e	[Inductor] Fix kernel argument ordering when using dynamic shapes with workspace (#152660 ) Summary: This PR fixes a bug in the Triton kernel invocation path where the `workspace_tensor` was inserted before the unpacked `extra_args` list in the final kernel argument list. This broke the expected ordering of arguments when dynamic shape size hints are emitted. When dynamic shapes are used, `extra_args` contains both size hint arguments and grid arguments. The kernel expects the argument list to follow the order: size hints → workspace tensor → grid args. But previously, the `workspace_tensor` was inserted before unpacking `extra_args`, resulting in: workspace tensor → size hints → grid args, which is incorrect. This fix constructs the workspace tensor earlier, allowing it to be slotted in after the size hints and before the grid arguments, restoring the expected argument layout. Test Plan: contbuild and OSS CI Reviewers: paulzhan Pull Request resolved: https://github.com/pytorch/pytorch/pull/152660 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg	2025-05-02 21:32:07 +00:00
David Berard	9ae4906b21	[inductor] Realize bucketize/searchsorted output (#152644 ) Context: bucketize is relatively expensive, computationally. So it's not always profitable to fuse it if it means doing extra computation. For example, this repro: https://gist.github.com/davidberard98/7fd6af7e6291787c246c705945a25554 shows a slowdown from 56us (eager) to ~100us (torch.compile-d): instead of computing 2\\15 binary searches, the fused version does 2\\15 * 384 - one for each of the broadcasted outputs. Solution: Realize the output of bucketize (and searchsorted, which also uses inductor's ops.bucketize). If there's an opportunity to do non-broadcasted fusions, the scheduler can still apply such fusions later on. After this PR, instead of a slowdown, we see an improvement from 56us (eager) to 33us (compiled). Differential Revision: [D74036850](https://our.internmc.facebook.com/intern/diff/D74036850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152644 Approved by: https://github.com/benjaminglass1, https://github.com/eellison	2025-05-02 20:31:17 +00:00
Alexander Grund	74b496e54c	Cleanup DeviceInterface in triton test (#152409 ) - Remove inherited functions - Return valid device_count (1 device: idx=0) - Remove unused function `triton_supported` Followup to #144399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152409 Approved by: https://github.com/jansel	2025-05-02 20:25:32 +00:00
Eddie Yan	ec68d082a1	[CUDA][TF32] Account for TF32 in `test_conv2d_same_padding` (#152618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152618 Approved by: https://github.com/msaroufim, https://github.com/Skylion007	2025-05-02 20:19:00 +00:00
Catherine Lee	39c0b01970	[ez] Disable failing test in periodic no gpu no avx (#152698 ) Failing on periodic after it was added in #152542 Ex inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh [GH job link](https://github.com/pytorch/pytorch/actions/runs/14775755628/job/41485185829) [HUD commit link](`6f6acb4128`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152698 Approved by: https://github.com/huydhn, https://github.com/hl475	2025-05-02 20:02:48 +00:00
Akash Verma	5d860c1e54	[ROCm][CI] Enabled fp8 distributed tests in test_micro_pipeline_tp.py for MI300 (#151977 ) This PR enabled fp8 distributed tests on MI300. For testing the added feature, ran distributed.tensor.parallel.test_micro_pipeline_tp test and all the tests passed successfully, and no tests were skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151977 Approved by: https://github.com/jeffdaily	2025-05-02 19:22:18 +00:00
Eddie Yan	0488883d6e	[cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076 ) turns out the backward is not supported yet, whoops Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076 Approved by: https://github.com/drisspg	2025-05-02 18:43:33 +00:00
Prachi Gupta	1ea2731e26	[ROCm] Add support for SymmetricMemory (#150580 ) This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream. NOTE: ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580 Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu Co-authored-by: Xiaodong Wang <xdwang@fb.com>	2025-05-02 18:35:14 +00:00
Laith Sakka	376529c78b	consolidate guard_or_x and definitely_x (#152463 ) definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463 Approved by: https://github.com/bobrenjc93	2025-05-02 18:08:11 +00:00
Jithun Nair	844842dfbf	[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-02 17:21:18 +00:00
Laith Sakka	f65fb0a23d	Make PGO code state not sensitive to file path by hashing file content when the file is available. (#152628 ) In some internal frameworks, on second attempts the actual code is copied to a different path than previous attempts. but its still the same. PGO will not work on those cased due to the following, sate entries before this PR used to be identified by (filepath, function name, line number). after this PR they are identified by (hash(filepath) , function name, line number). This way PGO will work for those jobs on future attempts and re-compilations of static versions will be avoided. Sometimes we do not have access to the source code, (file does not exists) This seems to happen mostly when we re-trace a compiled function but generally it can happen . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152628 Approved by: https://github.com/oulgen	2025-05-02 17:11:21 +00:00
Laith Sakka	38a9a8b7f7	Fix: Consider input defined unbacked during inductor codegen for runtime asserts (#152231 ) So when we use mark_unbacked the graph will have an unbacked inputs symInt. Right now, deferred runtime assertions that uses those is never generated. This PR changes that, such that in the forward graph we consider those and generate the corresponding runtime assertions of them. We still ignore them for backward which is not ideal The way we generate runtime assertion is by emitting them when all the defined unbacked symbols used in them are seen. We previously skipped placeholder, because for backward we have a wacky approach were we ignore input defined unbacked symbols and assumes assertions that uses them are already emitted in forward and we try to emit all other runtime assertions again. see [Note [Backwards runtime asserts] Doing that we ends up only emitting the runtime assertions that depends on things defined solely in backward, but we could miss checks that spans inputs defined in both backward and forward, i.e one symbol defined in forward passed as input to backward., and another that is defined in backward.) .This is not ideal an ideal approach could be something like this https://github.com/pytorch/pytorch/pull/151919 but it require more work . Pull Request resolved: https://github.com/pytorch/pytorch/pull/152231 Approved by: https://github.com/aorenste	2025-05-02 07:01:48 +00:00
Ke Wen	829752ba37	[SymmMem] Add all_to_all_vdev (#151819 ) Merge in/out splits into one tensor Multi-block Use sync instead of barrier Use nvshmemx_collective_launch Rotate blocks among peer write back input splits Parallel scan works Use scan for output offsets Use at most 16 blocks Pull Request resolved: https://github.com/pytorch/pytorch/pull/151819 Approved by: https://github.com/ngimel, https://github.com/fduwjj ghstack dependencies: #151261, #151498	2025-05-02 06:59:21 +00:00
Animesh Jain	9e3fc41060	[invoke_subgraph] rename identifiers to prevent python mangling (#152581 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152581 Approved by: https://github.com/BoyuanFeng, https://github.com/zou3519 ghstack dependencies: #152547	2025-05-02 06:46:05 +00:00
PyTorch MergeBot	4f9f1abd6d	Revert "Use swap_tensors path in nn.Module.to for all subclasses that override __torch_dispatch__ (#152539 )" This reverts commit `037343657e`. Reverted https://github.com/pytorch/pytorch/pull/152539 on behalf of https://github.com/wdvr due to failing internal tests - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/152539#issuecomment-2846484924))	2025-05-02 06:43:35 +00:00
Ke Wen	d7961a1086	[SymmMem] Add all-to-all (#151498 ) Add an all-to-all impl based on NVSHMEM's on-stream API `nvshmemx_alltoallmem_on_stream`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151498 Approved by: https://github.com/fegin, https://github.com/fduwjj ghstack dependencies: #151261	2025-05-02 06:40:43 +00:00

... 3 4 5 6 7 ...

33948 Commits