pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Kurt Mohler	1e3600b528	[MPS] Move `logaddexp/logaddexp2` to Metal and support complex (#166670 ) NOTE: Complex inputs are only supported in `logaddexp`. Since `logaddexp2` does not support complex inputs for CPU, it is not enabled for MPS in this PR either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166670 Approved by: https://github.com/malfet	2025-10-31 16:15:02 +00:00
Xuan Zhang	fee7624bd6	[PT2] set choice handler in config (#166607 ) Summary: We were setting the custom inductor choice using `torch._inductor.virtualized.V.set_choices_handler(CustomInductorChoices())`. However, this leads to inconsistent behaviors, even for jobs that are submitted back to back. In this diff, we pass in the choice handler via an inductor config and overwrite the default behavior when the config is provided. This sovles the inconsistent behavior. Test Plan: see D85785892 (internal only) Differential Revision: D85785879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166607 Approved by: https://github.com/eellison	2025-10-31 15:40:05 +00:00
Jeff Daily	24e94e021a	[ROCm][CI] create ROCm 7.1 magma tarball (#166693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-31 15:20:00 +00:00
Xuehai Pan	69be99ee51	Remove manually synced arch versions in `tools/nightly.py` (#166616 ) Discussed with @atalman offline. To reduce duplicate changes and reduce the number of files to change when updating arch versions. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/166616 Approved by: https://github.com/ezyang	2025-10-31 15:11:28 +00:00
Nikita Vedeneev	034e951b0c	[CUDA][cuBLASLt] addmm -- extend bias fusions to cases with (1 by n) shapes (#166307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166307 Approved by: https://github.com/eqy	2025-10-31 14:30:41 +00:00
Justin Chu	160ab53dd5	Update weight tensor initialization in RMSNormalization (#166550 ) Ensure a >1d tensor as weight for ORT compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166550 Approved by: https://github.com/titaiwangms	2025-10-31 14:29:27 +00:00
PyTorch MergeBot	5bcfdae71d	Revert "Make PT2 compile backprop through custom op without autograd key a hard error (#166367 )" This reverts commit `4acc66f119`. Reverted https://github.com/pytorch/pytorch/pull/166367 on behalf of https://github.com/atalman due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166367#issuecomment-3473150269))	2025-10-31 13:44:05 +00:00
PyTorch MergeBot	4e8ba37ce3	Revert "[BE] Move GreenContext implementation details to cpp (#166462 )" This reverts commit `5d288bc3f7`. Reverted https://github.com/pytorch/pytorch/pull/166462 on behalf of https://github.com/atalman due to Sorry, Reverting. Failure: test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_greencontext_carveout_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18962393091/job/54154156892) [HUD commit link](`85b035ca9c`) ([comment](https://github.com/pytorch/pytorch/pull/166462#issuecomment-3473060299))	2025-10-31 13:20:48 +00:00
PyTorch MergeBot	26534e9809	Revert "[GraphPartition] cache get_free_symbol_uses (#166338 )" This reverts commit `a6b1ef1717`. Reverted https://github.com/pytorch/pytorch/pull/166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](`a6b1ef1717`) ([comment](https://github.com/pytorch/pytorch/pull/166338#issuecomment-3472980329))	2025-10-31 12:57:56 +00:00
PyTorch MergeBot	657f8c3e21	Revert "Fix torch.full with dynamic tensor fill_value in torch.compile (#166554 )" This reverts commit `32066772b3`. Reverted https://github.com/pytorch/pytorch/pull/166554 on behalf of https://github.com/atalman due to Failure: test/nn/test_pooling.py::TestPoolingNNDeviceTypeCPU::test_max_pool_nan_inf_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18959368975/job/54144148546) [HUD commit link](`32066772b3`) ([comment](https://github.com/pytorch/pytorch/pull/166554#issuecomment-3472976911))	2025-10-31 12:55:31 +00:00
Mwiza Kunda	b0831930ed	[inductor] Mark / restrict tests that only work if ATen is used for matmul (#166518 ) These tests only work if max_autotune=False (default), which for matmul means falling back to ATen. This PR just documents / makes that transparent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166518 Approved by: https://github.com/eellison	2025-10-31 12:29:06 +00:00
arkadip-maitra	c01636e1bc	Fixes the sparse tensor issue (#163535 ) Fixes #148324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163535 Approved by: https://github.com/janeyx99	2025-10-31 11:48:31 +00:00
fengqing.lu	fd68d409ad	[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 ) This is the second PR split from https://github.com/pytorch/pytorch/pull/156272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162454 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-10-31 11:20:38 +00:00
Wang, Chuanqi	0d3a4f7155	[CD] Enable Inductor performance test for xpu (#166289 ) Add Dynamo benchmark performance tests for XPU backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/166289 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-10-31 10:52:07 +00:00
Xuehai Pan	108bb224f7	[pytree] add `treespec_{leaf,tuple,dict}` functions for args_spec modification (#160843 ) The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class. Changes: 1. Add function `treespec_leaf()` to replace `LeafSpec()`. 2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `args` / `*kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class. 3. Change `len(spec.children_specs)` to `spec.num_children`. 4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843 Approved by: https://github.com/mlazos	2025-10-31 10:33:16 +00:00
Yuanyuan Chen	fc8ac1216c	[4/N] Remove unused loop variables in tests (#166690 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690 Approved by: https://github.com/justinchuby, https://github.com/mlazos	2025-10-31 10:20:48 +00:00
Yuanyuan Chen	030de07aff	[2/N] Use 'is' in callable comparisons (#166685 ) It is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685 Approved by: https://github.com/xmfan, https://github.com/mlazos	2025-10-31 08:08:07 +00:00
Jazlyn Li	7d67a41db4	make FXConverter.generate use V.fake_mode instead of _detect_fake_mode_from_gm (#166591 ) Summary: FXConverter configurs _node_metadata_hook passing in `fake_mode` explicitly, which is relevant for cases down the line like `_generate_triton_call` that inserts a `triton_kernel_wrapper_mutation` node. This `fake_mode` is obtained from `_detect_fake_mode_from_gm`, which can be different from inductor set `V.fake_mode`. For example, while `V.fake_mode` is not None, `_detect_fake_mode_from_gm` can be None for a parent graph containing only a submodule which has no input args and only constants ``` parent graph(): %sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {}) return (getitem,) submodule graph(): %randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cuda, pin_memory: False}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {}) return (add,) ``` Getting this discrepnancy is flawed, it makes `_node_metadata_hook` try running inputs in a different "fake_mode" or no fake_mode when the rest of lowering uses `V.fake_mode`. In some cases where input is placed on custom non-gpu device, it can even complain with "requires device to be started" or tensor device mismatch. So this diff updates FXConverter.generate to use `V.fake_mode` which is populated by inductor properly. Test Plan: added a test `test_const_folded_subgraph` in `test_fxir_backend.py`, this test: - creates a graph module that calls a subgraph with no inputs and containing only const-foldable ops - const fold the subgraph - run FXConverter.generate, expect `fake_mode` used to code-generate is not None On the prior implementation when `_detect_fake_mode_from_gm` was used, this test would fail as fake_mode would be `None`. With this change, the test passes, `fake_mode` is properly collected from `V.fake_mode` which is not None. Differential Revision: D85767475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166591 Approved by: https://github.com/blaine-rister, https://github.com/mlazos, https://github.com/eellison	2025-10-31 05:52:07 +00:00
Minjang Kim	85b035ca9c	[nativert] Downcast triton double arguments to floats (#166620 ) This diff tries to fix a limitation in Sigmoid + Triton interaction, where float arguments are not correctly passed. NativeRT passes float arguments as double, while triton kernels were reading as a float, resulting in wrong values. --- ## Limitations in (de)seriazliation In triton, float arguments to a kernel are encoded as "fp32" ([code](https://github.com/triton-lang/triton-cpu/blob/main-merged/python/triton/runtime/jit.py#L310-L326)): ``` elif isinstance(arg, float): return ("fp32", None) ``` But it seems like that torch export serde uses double ([code](`d2eff5d454/torch/_export/serde/export_schema.thrift (L149)`)) because Thrift only has the double type: ``` union Argument { 10: bool as_none; 20: TensorArgument as_tensor; 30: list<TensorArgument> as_tensors; 50: i64 as_int; 70: list<i64> as_ints; 80: double as_float; ===> actually double ... ``` `TritonKernel` constructor loads attributes from a node, where `Constant` represents the variant type. And it only has `double` ([code](`d2eff5d454/torch/nativert/graph/Graph.h (L86)`)): ``` using Constant = std::variant< None, int64_t, std::vector<int64_t>, double, ===> triton float is loaded as double ``` So, NativeRT passes float arguments (originally in Triton) as double to triton kernels. But, all of the triton backends (nvidia, amd and cpu) are reading them as float because the signature still says `fp32`. D84423898 was the current workaround: wrapping float arguments with tensors. ## The Fix Fixing the thrift definition isn't viable because Thrift only supports double type. It's also possible to fix on the triton side: it can downcast from double to float. But I needed to fix all backends. Instead, I think this diff would be the most effective way: when building `TritonKernel`, have downcasted float values, right after loading double arguments. Test Plan: ``` buck test fbcode//mode/opt-amd-gpu fbcode//caffe2/test:test_export -- ``` Differential Revision: D85747160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166620 Approved by: https://github.com/XueningXu	2025-10-31 03:52:20 +00:00
William Wen	267d0197bf	[dynamo] fix error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586 ) Fixes https://github.com/pytorch/pytorch/issues/166589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166586 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166476, #166477	2025-10-31 03:36:27 +00:00
William Wen	1dec8a67a8	[dynamo, nested graph breaks] add disable_nested_graph_breaks decorator/context manager (#166477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166477 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007 ghstack dependencies: #166476	2025-10-31 03:36:27 +00:00
William Wen	797cd80b26	[dynamo, nested graph breaks] codegen dead nested cells correctly (#166476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166476 Approved by: https://github.com/Lucaskabela	2025-10-31 03:36:27 +00:00
PyTorch MergeBot	7d39401fa0	Revert "[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 )" This reverts commit `f1e4c42b6e`. Reverted https://github.com/pytorch/pytorch/pull/166569 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166569#issuecomment-3471180280))	2025-10-31 03:31:01 +00:00
Simon Layton	e3ae0594d1	Add CUDA MXFP4 scaled mm support via. FBGEMM (#166526 ) Summary: * Pull in `f4f4bf16` from FBGemm to provide MXFP4 support for CUDA * Add testing Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166526 Approved by: https://github.com/drisspg, https://github.com/ngimel	2025-10-31 03:17:27 +00:00
Lucas Kabela	f1e4c42b6e	[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 ) Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/` This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...) ### Coverage report: ``` mypy torch_dynamo/variables --linecount-report /tmp/coverage_log ``` Compare before to after - we go from 3826 to 7221 lines covered Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569 Approved by: https://github.com/williamwen42	2025-10-31 02:57:59 +00:00
Sun, Jiayi	d3e511f07c	[Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324 ) Summary: Support masked vectorization for the tail_loop for fp8 datatype. Example: ``` import torch def fn( x, scale, zero_point, quant_min, quant_max, dtype, ): x = torch.ops.quantized_decomposed.dequantize_per_tensor( x, scale, zero_point, quant_min, quant_max, dtype, ) x = torch.relu(x) x = torch.ops.quantized_decomposed.quantize_per_tensor( x, scale, zero_point, quant_min, quant_max, dtype ) return x quant_min = -128 quant_max = 127 dtype = torch.float8_e4m3fn x = torch.clamp(torch.randn((1, 7, 7, 9), dtype=torch.float32) * 100, quant_min, quant_max).to(dtype) zero_point = 100 scale = 0.01 with torch.no_grad(): compiled_fn = torch.compile(fn) compiled_fn(x, scale, zero_point, quant_min, quant_max, dtype) ``` Generated code: - Before ``` cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn', 'at::Float8_e4m3fn'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0, at::Float8_e4m3fn* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L))) { for (int64_t x0_tail = static_cast<int64_t>(432L);x0_tail < static_cast<int64_t>(441L); x0_tail++) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = float(tmp1 - tmp2); auto tmp4 = static_cast<float>(0.01); auto tmp5 = float(tmp3 * tmp4); auto tmp6 = c10::convert<float>(tmp5); auto tmp7 = std::max(tmp6, decltype(tmp6)(0)); auto tmp8 = float(tmp7 * tmp2); auto tmp9 = std::nearbyint(tmp8); auto tmp10 = float(tmp9 + tmp2); auto tmp11 = static_cast<float>(-128.0); auto tmp12 = max_propagate_nan(tmp10, tmp11); auto tmp13 = static_cast<float>(127.0); auto tmp14 = min_propagate_nan(tmp12, tmp13); auto tmp15 = c10::convert<at::Float8_e4m3fn>(tmp14); out_ptr0[static_cast<int64_t>(x0_tail)] = tmp15; } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1)) buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn) # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1 cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After ``` cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn', 'at::Float8_e4m3fn'], r''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(const at::Float8_e4m3fn* in_ptr0, at::Float8_e4m3fn* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L)) { { if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16)); } if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L))) { auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = static_cast<float>(100.0); auto tmp3 = at::vec::Vectorized<float>(tmp2); auto tmp4 = tmp1 - tmp3; auto tmp5 = static_cast<float>(0.01); auto tmp6 = at::vec::Vectorized<float>(tmp5); auto tmp7 = tmp4 * tmp6; auto tmp8 = (tmp7); auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0)); auto tmp10 = tmp9 * tmp3; auto tmp11 = tmp10.round(); auto tmp12 = tmp11 + tmp3; auto tmp13 = static_cast<float>(-128.0); auto tmp14 = at::vec::Vectorized<float>(tmp13); auto tmp15 = at::vec::maximum(tmp12, tmp14); auto tmp16 = static_cast<float>(127.0); auto tmp17 = at::vec::Vectorized<float>(tmp16); auto tmp18 = at::vec::minimum(tmp15, tmp17); auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18); tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L)); } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1)) buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn) # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1 cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163324 Approved by: https://github.com/Xia-Weiwen, https://github.com/mingfeima, https://github.com/jansel	2025-10-31 02:53:56 +00:00
Andy (An) Wang	d3be06cbdc	[MTIAGraph][Pytorch][2/n] Add binding for Python to C++, and hook for Pytorch to Fbcode (#165963 ) Summary: This diff is the binding and hook layer for MTIA Graph, including 1. binding between Python and C++ 2. hook between Pytorch and mtia fbcode <img width="1780" height="754" alt="image" src="https://github.com/user-attachments/assets/31e24e5b-8324-42d8-8d3b-59536bc18340" /> [Doc](https://docs.google.com/document/d/1Q3xdZAIqhBvuy2HxGDfJyXVmxYXUEeYSZSwsp7bcJF8/edit?tab=t.osb46a42t6wb#heading=h.ayp9tkk08x00) Test Plan: Will be tested in the python implementation which will use the binding and hook Differential Revision: D84457757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165963 Approved by: https://github.com/malfet, https://github.com/albanD	2025-10-31 02:52:51 +00:00
Jeff Daily	1129605415	[ROCm][CI] create ROCm 7.1 images for binary builds (#166665 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166665 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-31 02:52:37 +00:00
Boyuan Feng	a6b1ef1717	[GraphPartition] cache get_free_symbol_uses (#166338 ) Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. `ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)` I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. `ee7434be82/torch/_inductor/ir.py (L4541-L4543)` This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338 Approved by: https://github.com/eellison	2025-10-31 02:50:10 +00:00
Nikita Shulga	12577064dd	[MPS] Fix crash when max/min ops called for complex types (#166214 ) Raise an exception, as it's meaningless and results in segfault otherwise: ``` % python -c "import torch;torch.rand(10, dtype=torch.cfloat, device='mps').amax()" (mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: error: 'mps.reduction_max' op operand #0 must be tensor of mps native type values, but got 'tensor<10xcomplex<f32>>' (mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: note: see current operation: %2 = "mps.reduction_max"(%arg0, %1) <{keep_dims, propagate_nans}> : (tensor<10xcomplex<f32>>, tensor<1xsi32>) -> tensor<1xcomplex<f32>> (mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: error: 'mps.reduction_max' op operand #0 must be tensor of mps native type values, but got 'tensor<10xcomplex<f32>>' (mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: note: see current operation: %2 = "mps.reduction_max"(%arg0, %1) <{keep_dims, propagate_nans}> : (tensor<10xcomplex<f32>>, tensor<1xsi32>) -> tensor<1xcomplex<f32>> /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:1347: failed assertion `original module failed verification' zsh: abort python -c ``` To be tested by `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166214 Approved by: https://github.com/dcci, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: #166272	2025-10-31 02:37:20 +00:00
Tianren Gao	24b6eb7727	[Inductor] Enable Custom op Autotune Decompositions and Parameter Tuning (#164212 ) This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig: (1) to register (optional) multiple decomposition implementations for custom operations and (2) to register parameter tuning knobs and values they want to tune for the decompositions so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking. Example: ```python register_custom_op_autotuning( custom_op=my_attention_op, configs=[ CustomOpConfig(attention_impl, head_dim=32, method='chunked'), CustomOpConfig(attention_impl, head_dim=64, method='tiled'), CustomOpConfig(head_dim=128), # no decompositions ], input_gen_fns={ "query": lambda fake: torch.randn_like(fake, device='cuda'), "key": lambda fake: torch.randn_like(fake, device='cuda'), "value": lambda fake: torch.randn_like(fake, device='cuda'), } ) ``` CustomOpConfig: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used. Custom Input Generation: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument. More Examples with autotune logs:: 1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage: ```python from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor: """Matrix multiply with k-way decomposition.""" # Implementation...with k_splits @torch.library.custom_op("my_lib::decompose_k", mutates_args=()) def test_decompose_k_op( a: torch.Tensor, b: torch.Tensor, k_splits: int ) -> torch.Tensor: return decompose_k_implementation(a, b, k_splits) # Register autotuning with different k_splits values register_custom_op_autotuning( custom_op=test_decompose_k_op, configs=[ CustomOpConfig(decompose_k_implementation, k_splits=2), CustomOpConfig(decompose_k_implementation, k_splits=32), CustomOpConfig(decompose_k_implementation, k_splits=64), CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op CustomOpConfig(k_splits=256) ], input_gen_fns={ "a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, "b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, } ) ``` Example result: ``` {"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339} AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024) strides: [65536, 1], [1024, 1] dtypes: torch.float16, torch.float16 test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0% test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2 test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32 test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64 test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128 test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256 ``` 2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig. Example ```python def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method): """MLP implementation with different computational approaches.""" if method == 0: # Standard separate matmuls # ... implementation elif method == 1: # Batched approach with torch.mm # ... implementation elif method == 2: # Fused weights approach # ... implementation @torch.library.custom_op("my_lib::mlp_op", mutates_args=()) def mlp_op( input_tensor: torch.Tensor, gate_weight: torch.Tensor, up_weight: torch.Tensor, down_weight: torch.Tensor, method: int, ) -> torch.Tensor: return mlp_variants( input_tensor, gate_weight, up_weight, down_weight, method=method ) register_custom_op_autotuning( custom_op=mlp_op, configs=[ CustomOpConfig(method=0), CustomOpConfig(method=1), CustomOpConfig(method=2), # method=0 is the default fallback in the original op ], input_gen_fns={ "input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1, "gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05, # ... other input generators } ) ``` Example result: ``` AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256) test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2 test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1 test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback ``` ### Test Suite (`test/inductor/test_custom_op_autotune.py`) * RMSNorm autotuning: Tests different RMSNorm implementations with dynamic input shapes * MLP autotuning: Tests different MLP decomposition and tuning "method" parameter * DecomposeK: Tests different k_splits values for matrix multiplication decomposition with k dim split * Multi-parameter tuning: Tests configs with multiple tuning parameters (scale_mode, chunk_size) ### Next Step: - Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files - Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files - Support customop autotune considering fusion with multiTemplateBuffer. WIP Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212 Approved by: https://github.com/zou3519	2025-10-31 02:28:00 +00:00
Amal Dev Haridevan	32066772b3	Fix torch.full with dynamic tensor fill_value in torch.compile (#166554 ) Fixes #166253 ## Summary When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value. ## Root Cause The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant. ## Solution Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph. ## Testing Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554 Approved by: https://github.com/Lucaskabela	2025-10-31 00:56:02 +00:00
Nikita Shulga	47f0024310	[CI][BE] Factor out repeated test code (#166481 ) Into `_run_single_arg_fwd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166481 Approved by: https://github.com/Skylion007	2025-10-31 00:52:50 +00:00
Yuanyuan Chen	98d640bb11	Remove AT_USE_HIPSPARSE_GENERIC_API (#166393 ) This macro is not used in OSS anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166393 Approved by: https://github.com/ezyang	2025-10-31 00:49:09 +00:00
Nikita Shulga	5d288bc3f7	[BE] Move GreenContext implementation details to cpp (#166462 ) - Remove all complex defines logic from the header - Make GreenContext constructor private, as it should only be created via the static method as singleton - Delete unused `getContext` and `getGreenContext` methods - Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo - Suppress `-Wunused-private-field` is GreenContext is not available Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-31 00:48:01 +00:00
William Wen	bfb47ec50e	[dynamo] support tracing new typing union syntax X \| Y (#166599 ) To do in a followup - I think there's an approach to reconstruct typing variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166599 Approved by: https://github.com/SherlockNoMad, https://github.com/anijain2305, https://github.com/Skylion007	2025-10-30 23:59:27 +00:00
Prachi Gupta	7a0cd8ed09	[ROCm] Disable `__builtin_amdgcn_rcpf` for gfx90a (#166454 ) Improves accuracy for some failing tests. test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py::TestClipGradNormWorldSize4::test_clip_grad_norm_2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930221123/job/54046876467) [HUD commit link](`f20bf77874`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166454 Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily	2025-10-30 23:39:00 +00:00
angelayi	984e64b2cd	[inductor] Fix constant folder (#166655 ) Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1351999569843522/ where the resulting graph of constant folder uses a sym node which has been created later. Graph diff: https://www.internalfb.com/intern/diffing/?paste_number=2014609054 Before: ``` %full_65 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_47, 768], 1), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False}) %select_18 : [num_users=1] = call_function[target=torch.ops.aten.select.int](args = (%full_65, 1, 0), kwargs = {}) %mul_2792 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%select_18, 0), kwargs = {}) %embedding_4 : [num_users=1] = call_function[target=torch.ops.aten.embedding.default](args = (%_uv__surface_embeddings_weight, %mul_2792), kwargs = {}) ``` After: ``` %full_65 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_47, 768], 1), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False}) %full_default_1 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_150], 0), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False}) %embedding_4 : [num_users=1] = call_function[target=torch.ops.aten.embedding.default](args = (%_uv__surface_embeddings_weight, %full_default_1), kwargs = {}) ... %sym_size_int_150 : [num_users=7] = call_function[target=torch.ops.aten.sym_size.int](args = (%view_193, 0), kwargs = {}) ``` I couldn't figure out a small repro for this :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/166655 Approved by: https://github.com/eellison	2025-10-30 22:51:28 +00:00
Pian Pawakapan	b9bcb37f40	[DebugMode] store stringify args by default (#166347 ) DebugMode currently stores dispatch call args & kwargs, which is all intermediate tensors and more. This quickly OOMed on GPU when trying to debug some torchtitan / llama 8b models. This defaults to storing the stringified version, adding a flag `DebugMode(store_original_args=True)` if users want to store the original args as-is (and for BC). Pull Request resolved: https://github.com/pytorch/pytorch/pull/166347 Approved by: https://github.com/yushangdi	2025-10-30 22:12:23 +00:00
Chien-Chin Huang	7e3b9d105e	[CP][BE][2/2] Refactor the code structure (#166501 ) Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected. Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC. This is the second PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501 Approved by: https://github.com/Skylion007 ghstack dependencies: #166456	2025-10-30 22:07:07 +00:00
Artem Kuzmitckii	45c3f02d69	[ROCm] moved gfx1100 back to experimental status for AOTriton (#166397 ) According to next commit to AOTriton: `8625c4faee` These changes missed in 0.11b release: https://github.com/pytorch/pytorch/pull/161754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166397 Approved by: https://github.com/jeffdaily	2025-10-30 21:43:01 +00:00
eellison	f5543e3741	[wip] fix searchsorted non dense (#165064 ) Fix for https://github.com/pytorch/pytorch/issues/163528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165064 Approved by: https://github.com/benjaminglass1, https://github.com/mlazos	2025-10-30 21:21:24 +00:00
Nichols A. Romero	5fc2c7a2a1	[ROCm][inductor] More configs for pointwise kernels. (#166470 ) This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166470 Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel	2025-10-30 21:20:12 +00:00
zhudada	7692fa09cd	[Code Clean] Clean asserts in torch/ao/quantization/fx/* (#165420 ) Replace assert statements with explicit if/raise patterns in: - torch/ao/quantization/fx/* (177 errors) fix partialy #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165420 Approved by: https://github.com/RohitRathore1, https://github.com/fffrog, https://github.com/albanD	2025-10-30 20:53:36 +00:00
Eddie Yan	df71b70727	[cuDNN][conv] Re-enable cuDNN for 3D convolutions (fixed in 9.15+) (#166480 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166480 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-10-30 20:47:20 +00:00
Maggie Moss	80ba6e458f	Add warning when users have incomplete setup for type checking (#166603 ) Looking for feedback on this approach. Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly. Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch Testing: ``` hg clone --git ssh://git@github.com/pytorch/pytorch.git conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script) cd pytorch pip install -r requirements.txt pip install -r requirements-build.txt lintrunner init # check how many pyrefly errors - 15,709 errors (11,693 ignored) lintrunner # confirm error message / warning appears >>> General linter failure: Warning (PYREFLY) nightly-wheel-not-run pytorch-nightly.pth not found. You may need to run make setup-env or make setup-env-conda to install nightly binaries and type stubs. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166603 Approved by: https://github.com/aorenste	2025-10-30 20:37:44 +00:00
Yuanyuan Chen	0d50e5d8d4	[3/N] Fix unused loop variables (#166509 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166509 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-10-30 20:13:51 +00:00
Simon Layton	99b05d1b78	Better 1x128, 128x128 error handling on non-Hopper (#166639 ) Summary: Blockwise 1x128 and 128x128 scaling is only available on CUDA >= 12.9 and only on Hopper GPUs. Attempting to run on B200 would give a hard-to-debug `CUBLAS_STATUS_NOT_SUPPORTED`. Add a more helpful `NotImplementedError` to catch this case. Also more explicitly disable ROCm builds for relevant methods, based on lack of support per [hipBLASlt docs](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/reference/datatypes.html#_CPPv4N28hipblasLtMatmulMatrixScale_t40HIPBLASLT_MATMUL_MATRIX_SCALE_VEC128_32FE). Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166639 Approved by: https://github.com/drisspg	2025-10-30 20:13:06 +00:00
Eddie Yan	f911d64750	[CUDA] xFail `max-autotune` grouped gemm tests on devices with insufficient SM count (#165921 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165921 Approved by: https://github.com/ngimel	2025-10-30 20:05:07 +00:00
Yuanyuan Chen	52db60170d	Enable verify_dynamo on Python 3.13 (#166497 ) Dynamo now supports Python 3.13. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166497 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42	2025-10-30 19:52:32 +00:00

1 2 3 4 5 ...

95265 Commits