pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Boyuan Feng	dfebdcab86	[GraphPartition] cache get_free_symbol_uses (#166338 ) Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. `ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)` I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. `ee7434be82/torch/_inductor/ir.py (L4541-L4543)` This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338 Approved by: https://github.com/eellison	2025-10-31 21:24:05 +00:00
Wang, Chuanqi	b09fb481e0	[CD] Upgrade GCC version to 13 for XPU build (#162474 ) Follow #152426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162474 Approved by: https://github.com/zxiiro, https://github.com/atalman	2025-10-31 21:15:37 +00:00
Nikita Shulga	4e7232c5da	[MPS] Fix `smooth_l1_loss` backward for fp16 (#166687 ) And enable fp16 implementation for CPU, which simplifies OpInfo definitions for the op Pull Request resolved: https://github.com/pytorch/pytorch/pull/166687 Approved by: https://github.com/Skylion007 ghstack dependencies: #166214	2025-10-31 21:13:46 +00:00
PyTorch MergeBot	93a70c717a	Revert "Add CUDA MXFP4 scaled mm support via. FBGEMM (#166526 )" This reverts commit `e3ae0594d1`. Reverted https://github.com/pytorch/pytorch/pull/166526 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/166526#issuecomment-3474907536))	2025-10-31 21:10:28 +00:00
Yuanyuan Chen	d97144d31e	[5/N] Remove unused loop variables in tests (#166716 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166716 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007	2025-10-31 20:47:57 +00:00
William Wen	e4043884c7	[dynamo, 3.14] fix segfault due to improper create_call_function_ex (#166678 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166678 Approved by: https://github.com/malfet	2025-10-31 20:44:53 +00:00
Lucas Kabela	4a7bc1d522	[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 ) Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/` This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...) ### Coverage report: ``` mypy torch_dynamo/variables --linecount-report /tmp/coverage_log ``` Compare before to after - we go from 3826 to 7221 lines covered Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569 Approved by: https://github.com/williamwen42, https://github.com/Skylion007	2025-10-31 20:42:27 +00:00
Nicolas De Carli	8209a0506b	[Pytorch] Enable aarch64 convert autovec only on clang (#166739 ) Summary: We've noted issues with modern GCC versions. Until further investigation is carried, we'll leave the code only enabled on clang Test Plan: CI Differential Revision: D85968395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166739 Approved by: https://github.com/mcfi, https://github.com/Skylion007, https://github.com/robert-hardwick	2025-10-31 20:22:33 +00:00
William Wen	70aeb49198	[dynamo] clarify graph break handling/logging in symbolic_convert (#166587 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166587 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166476, #166477, #166586	2025-10-31 20:13:16 +00:00
Nikita Shulga	cf9a834f39	[BE] Move GreenContext implementation details to cpp (#166462 ) - Remove all complex defines logic from the header - Make GreenContext constructor private, as it should only be created via the static method as singleton - Delete unused `getContext` and `getGreenContext` methods - Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo - Suppress `-Wunused-private-field` is GreenContext is not available Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-31 20:11:02 +00:00
Yuanyuan Chen	856a7a5298	Add missing device to namedtensor tests (#166717 ) This PR passes unused `device` argument to tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166717 Approved by: https://github.com/Skylion007	2025-10-31 20:04:41 +00:00
Camyll Harajli	ef8d97efcf	fix broken nn_convolution test (#166666 ) Summary: Broken by oss diff during oncall by third party contributor Test Plan: buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:nn_convolution -- --run-disabled Differential Revision: D85899891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166666 Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/Skylion007	2025-10-31 19:59:50 +00:00
Fadi Arafeh	d2be06f673	[cpu][fix] Update ACL version to fix crashes with tensor sizes > 2^31-1 (#165904 ) ---- - Updates Arm Compute Library (ACL) to v52.6.0 - v52.6.0 contains https://github.com/ARM-software/ComputeLibrary/pull/1201 which fixes crashes with tensors of sizes > 2^31-1 fixes: #165654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165904 Approved by: https://github.com/malfet	2025-10-31 19:37:26 +00:00
James Wu	08f4535378	Refactor AOTAutogradCacheEntry into AOTAutogradResult (#166656 ) This PR refactors the name AOTAutogradCacheEntry into AOTAutogradResult, and BundledAOTAutogradCacheEntry into BundledAOTAutogradResult. It also moves all coresponding files to a new file, `aot_autograd_result`, which is analogous to `output_code.py` from Inductor. Having all these be called cache entries made sense when all we used them for was caching. But with AOT compile using BundledAOTAutogradCacheEntry, we want a more generalized naming structure. This is a no-op change, and all existing tests should pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166656 Approved by: https://github.com/zhxchen17 ghstack dependencies: #166650	2025-10-31 18:54:09 +00:00
James Wu	30157d30f0	Add regional aot eager support to AOTAutogradCacheEntry (#166650 ) This PR does two things: - It genericizes `BundledAOTAutogradCacheEntry` to support any outputcode, not just CompiledFxGraphs - It adds a brand new OutputCode for the `aot_eager_regional_inductor` backend, i.e. a graph module that has regional inductor components in it. This allows BundledAOTAutogradCache to just integrate nicely with inductor out of the box, but more importantly, it allows the result of aot_autograd to be fully serializable when using `aot_eager_regional_inductor`. This will allow us to AOT precompile cases where we have an eager graph that has scooped up inductor bits. It's a bit unfortunate that the naming makes BundledAOTAutogradCacheEntry sound like its primary use is for caching, but really the more common use is going to be as an AOTAutogradOutput. It may be worth revisiting how to refactor/rename these in a later PR: - AOTAutogradCacheEntry -> AOTAutogradResult - BundledAOTAutogradCacheEntry -> BundledAOTAutogradResult Pull Request resolved: https://github.com/pytorch/pytorch/pull/166650 Approved by: https://github.com/zhxchen17	2025-10-31 18:54:09 +00:00
IvanKobzarev	b470e59c38	partitioner option to ignore partitioner_tag for abstract usage (#166725 ) Partitioner functionality is appealing to use in different scenarios (E.g. Autoparallel) We have special logic about "partitioner_tag" from meta that is only needed for forward/backward split. Adding optional argument to avoid it and do only generic split based on inputs/outputs. Potentially we want to make `_extract_graph_with_inputs_outputs` without underscore :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166725 Approved by: https://github.com/bdhirsh	2025-10-31 18:50:02 +00:00
PyTorch MergeBot	85b85f6c2c	Revert "[pytree] add `treespec_{leaf,tuple,dict}` functions for args_spec modification (#160843 )" This reverts commit `108bb224f7`. Reverted https://github.com/pytorch/pytorch/pull/160843 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/160843#issuecomment-3474354428))	2025-10-31 18:31:32 +00:00
Nicolas De Carli	b71966f67b	[PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028 ) (#166641 ) Summary: PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON Retrying to land the code, after noting that these expressions became available in recent compiler versions. Current CI benchmark ‎binary_test.py will measure affected codepaths. Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2: Before: bfloat16 add: 250.503us bfloat16 sub: 245.674us bfloat16 neg: 113.945us bfloat16 abs: 115.953us bfloat16 reciprocal: 262.602us After: bfloat16 add: 203.862us ---> 23% higher throughput bfloat16 sub: 201.526us ---> 22% higher throughput bfloat16 neg: 68.416us ---> 67% higher throughput bfloat16 abs: 71.003us ---> 63% higher throughput bfloat16 reciprocal: 177.834us ---> 48% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85809843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166641 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-10-31 18:21:04 +00:00
Scott Wolchok	0947765eb9	Cache even more work for return_and_correct_aliasing (#166365 ) Yet another pass found even more work we can move to be done only once. This seems to knock a few microseconds off the DTensor dispatch fast path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166365 Approved by: https://github.com/bdhirsh	2025-10-31 18:03:05 +00:00
Jeff Daily	239e7b541a	[ROCm][CI] upgrade nightly wheels to ROCm 7.1 (#166730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166730 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-31 17:30:47 +00:00
Justin Chu	ffaa6578b7	Revise deprecation warning for ONNX exporter (#166692 ) Updated deprecation warning for ONNX export to reflect the current state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166692 Approved by: https://github.com/titaiwangms	2025-10-31 17:23:55 +00:00
Jane Xu	365ed62f61	Document LibTorch ABI more, add README to headeronly (#166661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166661 Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD	2025-10-31 17:18:13 +00:00
PyTorch MergeBot	fcc1063566	Revert "[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 )" This reverts commit `aa9c96af04`. Reverted https://github.com/pytorch/pytorch/pull/166569 on behalf of https://github.com/Lucaskabela due to Lintrunner not fixed due to race condition at landing ([comment](https://github.com/pytorch/pytorch/pull/166569#issuecomment-3474012637))	2025-10-31 16:59:33 +00:00
Jazlyn Li	121235956b	update Node.is_impure check if subgraph contains impure ops (#166609 ) Summary: ## Context when `const_fold.split_const_subgraphs` sees a `call_module` node that is a GraphModule, by the existing implementation it can mark this node as const-foldable when it shouldn't. For example, a parent graph contains a `call_module` to a subgraph that has no inputs but contain impure ops inside. ``` parent graph(): %sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {}) return (getitem,) submodule graph(): %randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cpu, pin_memory: False}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {}) return (add,) ``` when `submodule` graph is fed to const_fold.split_const_subgraph, it would come out unmodified since randn is impure. But if the `submodule` is called by a `parent` graph, when `parent` is fed to const_fold.split_const_subgraph, it would come out folded. ``` parent after fold graph(): %_fx_const_folded_attrs : [num_users=1] = get_attr[target=_FX_CONST_FOLDED_ATTRS] return (_fx_const_folded_attrs,) ``` This is because `node.is_impure()` check inside `const_fold.split_const_subgraph` fail through, leading the call_module node to be marked as pure. ## Fix We can update `fx.node.Node.is_impure` function to check for ops inside a call_module node with an additional `subgraph_has_impure_ops` check: - if a call_module node calls a GraphModule, - check any call_function nodes are impure ops - recursively check any call_module nodes that call GraphModule If the call_module subgraph has impure ops, return True to `is_impure` Test Plan: added tests to test_fx_const_fold.py Differential Revision: D85798483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166609 Approved by: https://github.com/blaine-rister	2025-10-31 16:58:18 +00:00
Lucas Kabela	aa9c96af04	[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 ) Provides type coverage to ~3000 LOC and 200 methods in `torch/_dynamo/variables/` This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...) ### Coverage report: ``` mypy torch_dynamo/variables --linecount-report /tmp/coverage_log ``` Compare before to after - we go from 3826 to 7221 lines covered Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569 Approved by: https://github.com/williamwen42	2025-10-31 16:56:50 +00:00
Jeff Daily	c3b71d5499	[ROCm][CI] remove relaxed tolerance for tf32 tests (#166478 ) Instead of relaxing tolerances for certain unit tests that exercise TF32 on MI300, skip the tests until hipblaslt accuracy is improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166478 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2025-10-31 16:15:42 +00:00
Kurt Mohler	1e3600b528	[MPS] Move `logaddexp/logaddexp2` to Metal and support complex (#166670 ) NOTE: Complex inputs are only supported in `logaddexp`. Since `logaddexp2` does not support complex inputs for CPU, it is not enabled for MPS in this PR either. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166670 Approved by: https://github.com/malfet	2025-10-31 16:15:02 +00:00
Xuan Zhang	fee7624bd6	[PT2] set choice handler in config (#166607 ) Summary: We were setting the custom inductor choice using `torch._inductor.virtualized.V.set_choices_handler(CustomInductorChoices())`. However, this leads to inconsistent behaviors, even for jobs that are submitted back to back. In this diff, we pass in the choice handler via an inductor config and overwrite the default behavior when the config is provided. This sovles the inconsistent behavior. Test Plan: see D85785892 (internal only) Differential Revision: D85785879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166607 Approved by: https://github.com/eellison	2025-10-31 15:40:05 +00:00
Jeff Daily	24e94e021a	[ROCm][CI] create ROCm 7.1 magma tarball (#166693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-31 15:20:00 +00:00
Xuehai Pan	69be99ee51	Remove manually synced arch versions in `tools/nightly.py` (#166616 ) Discussed with @atalman offline. To reduce duplicate changes and reduce the number of files to change when updating arch versions. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/166616 Approved by: https://github.com/ezyang	2025-10-31 15:11:28 +00:00
Nikita Vedeneev	034e951b0c	[CUDA][cuBLASLt] addmm -- extend bias fusions to cases with (1 by n) shapes (#166307 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166307 Approved by: https://github.com/eqy	2025-10-31 14:30:41 +00:00
Justin Chu	160ab53dd5	Update weight tensor initialization in RMSNormalization (#166550 ) Ensure a >1d tensor as weight for ORT compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166550 Approved by: https://github.com/titaiwangms	2025-10-31 14:29:27 +00:00
PyTorch MergeBot	5bcfdae71d	Revert "Make PT2 compile backprop through custom op without autograd key a hard error (#166367 )" This reverts commit `4acc66f119`. Reverted https://github.com/pytorch/pytorch/pull/166367 on behalf of https://github.com/atalman due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166367#issuecomment-3473150269))	2025-10-31 13:44:05 +00:00
PyTorch MergeBot	4e8ba37ce3	Revert "[BE] Move GreenContext implementation details to cpp (#166462 )" This reverts commit `5d288bc3f7`. Reverted https://github.com/pytorch/pytorch/pull/166462 on behalf of https://github.com/atalman due to Sorry, Reverting. Failure: test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_greencontext_carveout_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18962393091/job/54154156892) [HUD commit link](`85b035ca9c`) ([comment](https://github.com/pytorch/pytorch/pull/166462#issuecomment-3473060299))	2025-10-31 13:20:48 +00:00
PyTorch MergeBot	26534e9809	Revert "[GraphPartition] cache get_free_symbol_uses (#166338 )" This reverts commit `a6b1ef1717`. Reverted https://github.com/pytorch/pytorch/pull/166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](`a6b1ef1717`) ([comment](https://github.com/pytorch/pytorch/pull/166338#issuecomment-3472980329))	2025-10-31 12:57:56 +00:00
PyTorch MergeBot	657f8c3e21	Revert "Fix torch.full with dynamic tensor fill_value in torch.compile (#166554 )" This reverts commit `32066772b3`. Reverted https://github.com/pytorch/pytorch/pull/166554 on behalf of https://github.com/atalman due to Failure: test/nn/test_pooling.py::TestPoolingNNDeviceTypeCPU::test_max_pool_nan_inf_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18959368975/job/54144148546) [HUD commit link](`32066772b3`) ([comment](https://github.com/pytorch/pytorch/pull/166554#issuecomment-3472976911))	2025-10-31 12:55:31 +00:00
Mwiza Kunda	b0831930ed	[inductor] Mark / restrict tests that only work if ATen is used for matmul (#166518 ) These tests only work if max_autotune=False (default), which for matmul means falling back to ATen. This PR just documents / makes that transparent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166518 Approved by: https://github.com/eellison	2025-10-31 12:29:06 +00:00
arkadip-maitra	c01636e1bc	Fixes the sparse tensor issue (#163535 ) Fixes #148324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163535 Approved by: https://github.com/janeyx99	2025-10-31 11:48:31 +00:00
fengqing.lu	fd68d409ad	[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454 ) This is the second PR split from https://github.com/pytorch/pytorch/pull/156272 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162454 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg	2025-10-31 11:20:38 +00:00
Wang, Chuanqi	0d3a4f7155	[CD] Enable Inductor performance test for xpu (#166289 ) Add Dynamo benchmark performance tests for XPU backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/166289 Approved by: https://github.com/EikanWang, https://github.com/atalman	2025-10-31 10:52:07 +00:00
Xuehai Pan	108bb224f7	[pytree] add `treespec_{leaf,tuple,dict}` functions for args_spec modification (#160843 ) The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class. Changes: 1. Add function `treespec_leaf()` to replace `LeafSpec()`. 2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `args` / `*kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class. 3. Change `len(spec.children_specs)` to `spec.num_children`. 4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`. ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843 Approved by: https://github.com/mlazos	2025-10-31 10:33:16 +00:00
Yuanyuan Chen	fc8ac1216c	[4/N] Remove unused loop variables in tests (#166690 ) This PR removes unused loop variables in tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690 Approved by: https://github.com/justinchuby, https://github.com/mlazos	2025-10-31 10:20:48 +00:00
Yuanyuan Chen	030de07aff	[2/N] Use 'is' in callable comparisons (#166685 ) It is generally advised to use `is/is not` for comparisons against torch functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685 Approved by: https://github.com/xmfan, https://github.com/mlazos	2025-10-31 08:08:07 +00:00
Jazlyn Li	7d67a41db4	make FXConverter.generate use V.fake_mode instead of _detect_fake_mode_from_gm (#166591 ) Summary: FXConverter configurs _node_metadata_hook passing in `fake_mode` explicitly, which is relevant for cases down the line like `_generate_triton_call` that inserts a `triton_kernel_wrapper_mutation` node. This `fake_mode` is obtained from `_detect_fake_mode_from_gm`, which can be different from inductor set `V.fake_mode`. For example, while `V.fake_mode` is not None, `_detect_fake_mode_from_gm` can be None for a parent graph containing only a submodule which has no input args and only constants ``` parent graph(): %sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {}) return (getitem,) submodule graph(): %randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cuda, pin_memory: False}) %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {}) return (add,) ``` Getting this discrepnancy is flawed, it makes `_node_metadata_hook` try running inputs in a different "fake_mode" or no fake_mode when the rest of lowering uses `V.fake_mode`. In some cases where input is placed on custom non-gpu device, it can even complain with "requires device to be started" or tensor device mismatch. So this diff updates FXConverter.generate to use `V.fake_mode` which is populated by inductor properly. Test Plan: added a test `test_const_folded_subgraph` in `test_fxir_backend.py`, this test: - creates a graph module that calls a subgraph with no inputs and containing only const-foldable ops - const fold the subgraph - run FXConverter.generate, expect `fake_mode` used to code-generate is not None On the prior implementation when `_detect_fake_mode_from_gm` was used, this test would fail as fake_mode would be `None`. With this change, the test passes, `fake_mode` is properly collected from `V.fake_mode` which is not None. Differential Revision: D85767475 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166591 Approved by: https://github.com/blaine-rister, https://github.com/mlazos, https://github.com/eellison	2025-10-31 05:52:07 +00:00
Minjang Kim	85b035ca9c	[nativert] Downcast triton double arguments to floats (#166620 ) This diff tries to fix a limitation in Sigmoid + Triton interaction, where float arguments are not correctly passed. NativeRT passes float arguments as double, while triton kernels were reading as a float, resulting in wrong values. --- ## Limitations in (de)seriazliation In triton, float arguments to a kernel are encoded as "fp32" ([code](https://github.com/triton-lang/triton-cpu/blob/main-merged/python/triton/runtime/jit.py#L310-L326)): ``` elif isinstance(arg, float): return ("fp32", None) ``` But it seems like that torch export serde uses double ([code](`d2eff5d454/torch/_export/serde/export_schema.thrift (L149)`)) because Thrift only has the double type: ``` union Argument { 10: bool as_none; 20: TensorArgument as_tensor; 30: list<TensorArgument> as_tensors; 50: i64 as_int; 70: list<i64> as_ints; 80: double as_float; ===> actually double ... ``` `TritonKernel` constructor loads attributes from a node, where `Constant` represents the variant type. And it only has `double` ([code](`d2eff5d454/torch/nativert/graph/Graph.h (L86)`)): ``` using Constant = std::variant< None, int64_t, std::vector<int64_t>, double, ===> triton float is loaded as double ``` So, NativeRT passes float arguments (originally in Triton) as double to triton kernels. But, all of the triton backends (nvidia, amd and cpu) are reading them as float because the signature still says `fp32`. D84423898 was the current workaround: wrapping float arguments with tensors. ## The Fix Fixing the thrift definition isn't viable because Thrift only supports double type. It's also possible to fix on the triton side: it can downcast from double to float. But I needed to fix all backends. Instead, I think this diff would be the most effective way: when building `TritonKernel`, have downcasted float values, right after loading double arguments. Test Plan: ``` buck test fbcode//mode/opt-amd-gpu fbcode//caffe2/test:test_export -- ``` Differential Revision: D85747160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166620 Approved by: https://github.com/XueningXu	2025-10-31 03:52:20 +00:00
William Wen	267d0197bf	[dynamo] fix error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586 ) Fixes https://github.com/pytorch/pytorch/issues/166589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166586 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166476, #166477	2025-10-31 03:36:27 +00:00
William Wen	1dec8a67a8	[dynamo, nested graph breaks] add disable_nested_graph_breaks decorator/context manager (#166477 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166477 Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007 ghstack dependencies: #166476	2025-10-31 03:36:27 +00:00
William Wen	797cd80b26	[dynamo, nested graph breaks] codegen dead nested cells correctly (#166476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166476 Approved by: https://github.com/Lucaskabela	2025-10-31 03:36:27 +00:00
PyTorch MergeBot	7d39401fa0	Revert "[BE][Typing][Dynamo] Type misc files in `torch/_dynamo/variables/` (#166569 )" This reverts commit `f1e4c42b6e`. Reverted https://github.com/pytorch/pytorch/pull/166569 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166569#issuecomment-3471180280))	2025-10-31 03:31:01 +00:00
Simon Layton	e3ae0594d1	Add CUDA MXFP4 scaled mm support via. FBGEMM (#166526 ) Summary: * Pull in `f4f4bf16` from FBGemm to provide MXFP4 support for CUDA * Add testing Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/166526 Approved by: https://github.com/drisspg, https://github.com/ngimel	2025-10-31 03:17:27 +00:00

1 2 3 4 5 ...

95291 Commits