pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
William Wen	ea698e8bfc	[dynamo, nested graph breaks] disallow nested graph breaks in HOPs (#166016 ) As discussed offline with @ydwu4, we should not allow nested graph breaks in HOPs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166016 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013, #166015, #165808, #165809	2025-10-28 03:03:38 +00:00
William Wen	7f7a28046b	[dynamo, nested graph breaks] disable nested graph breaks in generators; enable nested_graph_breaks in test_ctx_manager.py and test_generator.py (#165809 ) Generators should not support nested graph breaks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165809 Approved by: https://github.com/Lucaskabela, https://github.com/guilhermeleobas ghstack dependencies: #166013, #166015, #165808	2025-10-28 03:03:37 +00:00
William Wen	d8283a317a	[dynamo, nested graph breaks] fix RETURN_VALUE tx skipping in nested graph breaks (#165808 ) Previously, we would completely skip building and calling any resume function if the leaf frame's resume instruction was RETURN_VALUE/RETURN_CONST. Now, we only skip building/calling resume functions for frames that are resuming on RETURN_VALUE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165808 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013, #166015	2025-10-28 03:03:37 +00:00
William Wen	e0ca3049c0	[dynamo, nested graph breaks] remove _dynamo.utils.counter patch on inlined tx'es (#166015 ) This `patch.dict(counters, ...` appears to be ancient code that doesn't really seem to be doing anything? It causes issues in nested graph breaks because the patch cleanup clears out the record of the nested graph break. Removing the patch to see if it's even needed in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166015 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166013	2025-10-28 03:03:37 +00:00
William Wen	8417981c96	[dynamo, nested graph breaks] add TestCaseWithNestedGraphBreaks subclass (#166013 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166013 Approved by: https://github.com/Lucaskabela	2025-10-28 03:03:37 +00:00
Simon Fan	06e71c8558	[hop] local_map MoE: fix unbacked symints during tracing and symint activations order in the wrapper (#165551 ) This PR fixes 2 issues with local_mapping token-choice moe. Splits from the fw token dispatch result in tensors with unbacked shapes and these unbacked shapes are fully contained in the a2as, and should not leak outside of the joint graph. The HOP body fw and bw are expected to coerce back to static shapes (due to adding it with shared experts output) before returning. ```python routed_output: "bf16[u0 + u1 + u10 + u11 + u12 + u13 + u14 + u15 + u16 + u17 + u18 + u19 + u2 + u20 + u21 + u22 + u23 + u24 + u25 + u26 + u27 + u28 + u29 + u3 + u30 + u31 + u32 + u33 + u34 + u35 + u36 + u37 + u38 + u39 + u4 + u40 + u41 + u42 + u43 + u44 + u45 + u46 + u47 + u48 + u49 + u5 + u50 + u51 + u52 + u53 + u54 + u55 + u56 + u57 + u58 + u59 + u6 + u60 + u61 + u62 + u63 + u7 + u8 + u9, 2048]" = torch.ops.higher_order.autograd_function_apply(fwd_body_1, bwd_body_1, out_1, item, item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10, item_11, item_12, item_13, item_14, item_15, item_16, item_17, item_18, item_19, item_20, item_21, item_22, item_23, item_24, item_25, item_26, item_27, item_28, item_29, item_30, item_31, item_32, item_33, item_34, item_35, item_36, item_37, item_38, item_39, item_40, item_41, item_42, item_43, item_44, item_45, item_46, item_47, item_48, item_49, item_50, item_51, item_52, item_53, item_54, item_55, item_56, item_57, item_58, item_59, item_60, item_61, item_62, item_63, item_64, item_65, item_66, item_67, item_68, item_69, item_70, item_71, item_72, item_73, item_74, item_75, item_76, item_77, item_78, item_79, item_80, item_81, item_82, item_83, item_84, item_85, item_86, item_87, item_88, item_89, item_90, item_91, item_92, item_93, item_94, item_95, item_96, item_97, item_98, item_99, item_100, item_101, item_102, item_103, item_104, item_105, item_106, item_107, item_108, item_109, item_110, item_111, item_112, item_113, item_114, item_115, item_116, item_117, item_118, item_119, item_120, item_121, item_122, item_123, item_124, item_125, item_126, item_127, args_tensor_mask = [True, False, False, False], non_differentiable_idx = []); fwd_body_1 = bwd_body_1 = out_1 = item = item_1 = item_2 = item_3 = item_4 = item_5 = item_6 = item_7 = item_8 = item_9 = item_10 = item_11 = item_12 = item_13 = item_14 = item_15 = item_16 = item_17 = item_18 = item_19 = item_20 = item_21 = item_22 = item_23 = item_24 = item_25 = item_26 = item_27 = item_28 = item_29 = item_30 = item_31 = item_32 = item_33 = item_34 = item_35 = item_36 = item_37 = item_38 = item_39 = item_40 = item_41 = item_42 = item_43 = item_44 = item_45 = item_46 = item_47 = item_48 = item_49 = item_50 = item_51 = item_52 = item_53 = item_54 = item_55 = item_56 = item_57 = item_58 = item_59 = item_60 = item_61 = item_62 = item_63 = item_64 = item_65 = item_66 = item_67 = item_68 = item_69 = item_70 = item_71 = item_72 = item_73 = item_74 = item_75 = item_76 = item_77 = item_78 = item_79 = item_80 = item_81 = item_82 = item_83 = item_84 = item_85 = item_86 = item_87 = item_88 = item_89 = item_90 = item_91 = item_92 = item_93 = item_94 = item_95 = item_96 = item_97 = item_98 = item_99 = item_100 = item_101 = item_102 = item_103 = item_104 = item_105 = item_106 = item_107 = item_108 = item_109 = item_110 = item_111 = item_112 = item_113 = item_114 = item_115 = item_116 = item_117 = item_118 = item_119 = item_120 = item_121 = item_122 = item_123 = item_124 = item_125 = item_126 = item_127 = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:777 in local_mapped_region, code: torch._check(routed_output.shape[0] == shape[0] * shape[1]) size_3 = routed_output.size() getitem_139 = size_3[1]; size_3 = getitem_139 = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:779 in local_mapped_region, code: routed_output = routed_output.view(shape) routed_output_1: "bf16[4, 6144, 2048]" = routed_output.view((4, 6144, 2048)); routed_output = None # File: /home/xmfan/core/a/autoparallel/examples/example_ds3_local_map.py:781 in local_mapped_region, code: out = out.scatter_add(dim=1, index=token_indices_experts_sorted, src=routed_output) out_3: "bf16[4, 1024, 2048]" = out_2.scatter_add(dim = 1, index = token_indices_experts_sorted_2, src = routed_output_1); out_2 = token_indices_experts_sorted_2 = routed_output_1 = None ``` ## 1. Unbacked symints contained within the HOP body Based on `9b2974e812` and `36030e0315`. We disable proxy mode so that unbacked symints that are contained within the HOP subgraph aren't proxied: ```python [rank0]: RuntimeError: u576 + u577 + u578 + u579 + u580 + u581 + u582 + u583 + u584 + u585 + u586 + u587 + u588 + u589 + u590 + u591 + u592 + u593 + u594 + u595 + u596 + u597 + u598 + u599 + u600 + u601 + u602 + u603 + u604 + u605 + u606 + u607 + u608 + u609 + u610 + u611 + u612 + u613 + u614 + u615 + u616 + u617 + u618 + u619 + u620 + u621 + u622 + u623 + u624 + u625 + u626 + u627 + u628 + u629 + u630 + u631 + u632 + u633 + u634 + u635 + u636 + u637 + u638 + u639 + 1 (140667108386064)is not tracked with proxy for <torch.fx.experimental.proxy_tensor.PythonKeyTracer object at 0x7fef9d44f950> ``` And we ensure that no unbacked symints leak outside of the region. ## 2. Saved symint activations local_map is using the partitioned backward, and needs to follow the partitioner's desired ordering, this is the same order as AOTAutograd runtime wrapper uses in `_backward_prologue_functional` where we pass symints first: `d2c82bafb7/torch/_functorch/_aot_autograd/runtime_wrappers.py (L1702-L1704)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165551 Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh ghstack dependencies: #164780	2025-10-28 02:52:41 +00:00
Simon Fan	a76b59cc45	[dynamo] local_map error message for reordered inputs (#164780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164780 Approved by: https://github.com/mlazos	2025-10-28 02:52:41 +00:00
Shangdi Yu	236ce736a1	[reland] Add provenance to inductor IR nodes created after graph.run (#164255 ) (#164746 ) Summary: as title - Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes. - Extract `assign_origin_node` function Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_deferred_triton_kernels ``` Differential Revision: D83979975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164746 Approved by: https://github.com/mlazos	2025-10-28 02:20:20 +00:00
karthickai	1425b40f29	[inductor] Fix argmin/argmax returning incorrect indices for non-contiguous tensor (#165983 ) Fixes #163929 Fixes argmin/argmax operations to return correct logical indices instead of physical memory offsets when applied to transposed/permuted tensors. When `argmin()` or `argmax()` is called on a transposed tensor, Inductor was returning physical memory indices instead of logical row-major indices. This caused incorrect results that don't match eager mode behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165983 Approved by: https://github.com/shunting314	2025-10-28 01:23:24 +00:00
fduwjj	46d17e8871	[Symm mem] Add a unit test for mempool tensor with dist collective (#166206 ) We haven't tried to see if tensors on nvshmem calling c10d collectives work or not. This PR is adding a show case for it inside UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166206 Approved by: https://github.com/ngimel	2025-10-28 00:41:47 +00:00
Menglu Yu	e95920e3e6	[Optimus] Rename the post_grad_graph tlparse log (#166109 ) Summary: ezyang observed a cache miss issue, see details in https://github.com/pytorch/pytorch/issues/166012 We thus rename the post_grad_graph tlparse log name to resolve the cache issue. Differential Revision: D85309891 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166109 Approved by: https://github.com/jamesjwu	2025-10-28 00:23:01 +00:00
Dzmitry Huba	a51f877287	Enable local tensor mode for another set of DTensor tests (#166105 ) Enable local tensor mode DTensor tests for the optimizers, op strategy, matrix ops, math ops, init ops, experimental ops, embedding ops, dynamic, convolution ops, main api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166105 Approved by: https://github.com/ezyang	2025-10-27 23:58:24 +00:00
Ruben Rodriguez Buchillon	b44423bbb4	[inductor][choices] lookup table choices 1/3 (#164978 ) \# why - enable users to control which choices get used on which inputs - reduce lowering time, and pin kernel selection, by selecting them for the inputs \# what - a new InductorChoices subclass that implements a lookup table - a README explaining the usage - corresponding testing - currently only supports templates that go through `V.choices.get_template_configs` \# testing ``` python3 -bb -m pytest test/inductor/test_lookup_table.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978 Approved by: https://github.com/PaulZhang12, https://github.com/eellison	2025-10-27 23:45:16 +00:00
Animesh Jain	8e1e4ee8e0	[reland][dynamo][easy] Support torch.accelerator.current_accelerator (#166327 ) Reland https://github.com/pytorch/pytorch/pull/165734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166327 Approved by: https://github.com/Lucaskabela	2025-10-27 23:41:43 +00:00
Isalia20	1e836bc769	[MPS] fix large matmul test device (#166271 ) PR is self explanatory Test was introduced by https://github.com/pytorch/pytorch/pull/143095 and was always running on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/166271 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-27 22:56:59 +00:00
Millie Chen	9a91486e45	[Inductor-FX] Don't flatten constant args (#166144 ) Summary: Fallback kernels are created with flattened constant args and an `unflatten` utility to unflatten them when needed. Apply it in FXConverter to preserve the original structure Test Plan: added new CI tests Differential Revision: D85347589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166144 Approved by: https://github.com/blaine-rister	2025-10-27 22:33:37 +00:00
Tugsbayasgalan Manlaibaatar	47ec1e9990	Support regional inductor with custom config (#166269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166269 Approved by: https://github.com/anijain2305	2025-10-27 21:46:02 +00:00
fduwjj	904abfc2ca	Export flex attention with kwargs and DTensor (#166045 ) Fixes #165948 Adding registration of the MaskBlock makes flex attention with kwargs exportable. Also modified unittests to accept kwargs ``` python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export python test/inductor/test_flex_attention.py -k test_pytree_ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045 Approved by: https://github.com/drisspg, https://github.com/SherlockNoMad Co-authored-by: fduwjj <fduwjj@gmail.com>	2025-10-27 21:40:40 +00:00
Scott Wolchok	7d16fcf2df	Re-re-re-re-apply "C++-accessible Placements via pybind11 (#163030 )" (#166132 ) Was reverted (again!) due to a merge conflict that crept in sometime during the "export to github -> land internally -> merge on github" process. D85096233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166132 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet	2025-10-27 21:19:32 +00:00
Anshul Sinha	483845a9c4	[DTensor][Op] fix for DTensor ops with Partial placements (#165962 ) Summary: When operations are done on partial placements, we use sharding logic to incorrectly determine whether we should redistribute the tensor to replicate. By delaying the redistribution, we do the operation first, and then the partial reduction. This leads to incorrect results for max, min, gradient norm clipping, and more. We solve this by setting reduction_linear to False when there is a Partial placement to force the redistribution before completing the op. Test Cases 1. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops 2. pytest test/distributed/tensor/test_math_ops.py -k test_matching_partial_reduction_ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/165962 Approved by: https://github.com/wconstab	2025-10-27 21:17:13 +00:00
Anshul Sinha	60bcb4ee88	[pipeline][be] refactored pipeline composability tests (#165701 ) Summary: The first thing I did was increase the world size to 8 because test_3d_with_tp_dp_pp wouldn't actually do fully shard as tp = 2, pp = 2, leaving dp = 1. The second thing was refactoring the tests using both single and multi stage schedules so that their logic is largely combined. This was accomplished by using the logic in test_replicate_pp_grad multi-stage schedule to determine the start and end indices for a partial model, but setting virtual_stage to 1 if we are using single stage schedules. Even if this approach isn't approved, multistage schedule logic in test_3d_with_tp_dp_pp and test_replicate_pp should be changed as the logic used is incorrect. Test Case 1. pytest test/distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/165701 Approved by: https://github.com/H-Huang	2025-10-27 21:08:57 +00:00
Animesh Jain	ee7434be82	[dynamo][guards] 1/N Guard selectively for DTensor (#165824 ) A few internal jobs are observing very high guard overhead for DTensor. Since we own DTensor, we can make those guards way faster. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165824 Approved by: https://github.com/Lucaskabela, https://github.com/bdhirsh	2025-10-27 20:35:40 +00:00
Tugsbayasgalan Manlaibaatar	6096c0fc74	Export should use aot_export_joint_with_descriptors (#165931 ) This diff moves export run_decompositions to use aot_export_joint_with_descriptors instead of aot_export_module. Doing so, i ran into 2 main bugs: 1) aot_export_joint_with_descriptors don't correctly pass in record_nn_module_stack flag that is needed to populate nn_module_stack by switching the internal tracer. 2) When creating symint with negative inputs, we need to pass in positive=False. This didn't matter before because aot_autograd directly returns integer inputs instead of creating symint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165931 Approved by: https://github.com/zhxchen17	2025-10-27 19:33:33 +00:00
Sarthak Tandon	3f69b4d9b4	[ROCm][tunableop] Fixes flaky test issue (#166084 ) Fixes #165603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166084 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-27 18:13:30 +00:00
Animesh Jain	610c09f8f4	[dynamo] Fix python_type for UserDefinedClassExceptionVariable (#166251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166251 Approved by: https://github.com/Lucaskabela	2025-10-27 16:47:32 +00:00
PyTorch UpdateBot	90d7be35e9	Update slow tests (#165894 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165894 Approved by: https://github.com/pytorchbot	2025-10-27 11:42:14 +00:00
Yuxin Wu	173bcda436	Quick fix of torch.save memory leak (#165204 ) Fix the memory leak shown in https://github.com/pytorch/pytorch/issues/149846#issuecomment-3392634572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165204 Approved by: https://github.com/ezyang	2025-10-27 07:50:58 +00:00
fduwjj	6530bc70fb	[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 ) Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users. One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358 Approved by: https://github.com/fegin	2025-10-27 07:39:21 +00:00
Deng, Daisy	81fa4a204c	Enable Intel GPU on 4 unit test cases (#165405 ) For https://github.com/pytorch/pytorch/issues/114850, we will port some aten unit tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests 2. Added allow_xpu=True for supported test class in test parameterization. 3. Use torch.accelerator to extend cude specific test to XPU if needed. 4. Enabled 'xpu' for some test pathes Pull Request resolved: https://github.com/pytorch/pytorch/pull/165405 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-10-27 06:06:07 +00:00
Weinan Liu	fa4cb91846	add support for ir scalar literal parsing for inf/-inf/True/False (#163924 ) Currently the ir parser doesn't support parse ir like ``` graph(): %12 : float = prim::Constant[value=-inf]() %13 : float = prim::Constant[value=inf]() %14 : bool = prim::Constant[value=True]() %15 : bool = prim::Constant[value=False]() return (%12) ``` So the python script below will throw error. ``` #!/bin/env python import torch def test(): return [True, False] f = torch.jit.script(test) torch._C._jit_pass_constant_propagation(f.graph) ts_str = f.graph.__repr__() print(ts_str) ts = torch.parse_ir(ts_str) func = torch._C._create_function_from_graph("forward", ts) ret = func() assert ret == [True, False] def test(): return [float("inf"), float("-inf")] f = torch.jit.script(test) torch._C._jit_pass_constant_propagation(f.graph) ts_str = f.graph.__repr__() print(ts_str) ts = torch.parse_ir(ts_str) func = torch._C._create_function_from_graph("forward", ts) ret = func() assert ret == [float("inf"), float("-inf")] ``` I add "inf" and bool cases for IRParser::parseScalarLiteral in irparser.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163924 Approved by: https://github.com/ezyang	2025-10-27 05:10:21 +00:00
fduwjj	000f49551b	[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 ) (#166264 ) Summary: Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code. We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D85526705 Pulled By: fduwjj Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264 Approved by: https://github.com/XilunWu	2025-10-27 03:15:15 +00:00
James Wu	507614ba43	Add GraphModule.recompile_submodules, use for regional inductor (#166002 ) This makes it so that `GraphModule.recompile()` will also recompile any submodules that are also graph modules, which allows us to pass all existing regional inductor tests without skipping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166002 Approved by: https://github.com/oulgen ghstack dependencies: #165996	2025-10-27 01:40:51 +00:00
Dzmitry Huba	86f9f1d0ab	Enable local tensor model for DTensor redistribute tests (#166081 ) Redistribute test exercise extensively various sharding schemes and redistribution between them. These tests uncovered more edge cases that were not supported by the local tensor primarily different flavors of uneven sharding. In order to handle these cases this change implements missing functional collectives and adds support for uneven sharding case where sharding group (ranks) is larger than the size of the dimension being sharded. In the latter case the "missing" shards are represented by zero sized tensors so that the rest of the local tensor machinery can stay oblivious to this special case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081 Approved by: https://github.com/ezyang	2025-10-26 22:21:43 +00:00
Animesh Jain	a2b6afeac5	[dynamo][guards] CLASS_MATCH guard for readability (#166217 ) We were using FUNCTION_MATCH guard for classes. This was very confusing (although correct). Pull Request resolved: https://github.com/pytorch/pytorch/pull/166217 Approved by: https://github.com/jansel	2025-10-26 18:35:27 +00:00
Animesh Jain	262830d86c	[dynamo] Repro for 166238 (#166252 ) xfail repro for https://github.com/pytorch/pytorch/issues/166238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166252 Approved by: https://github.com/XuehaiPan, https://github.com/jansel	2025-10-26 18:34:22 +00:00
James Wu	e4c01011c2	Mark FlexAttentionBackward as cacheable (#165996 ) This probably should have been marked cacheable a long time ago, no reason that it isn't. Test Plan: New regional inductor tests for test_flex_attention now are serializable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165996 Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/drisspg	2025-10-26 14:39:17 +00:00
Wei Zhang	f863550192	[dtensor] fix incorrect norm calculation for Partial DTensors (#159856 ) The sharding strategies for `aten.linalg_vector_norm` and the optimized `aten._foreach_norm.Scalar` incorrectly assumes the norm operation is always "reduction linear" with respect to its inputs. This bug causes the norm to be computed on local, incomplete data for DTensors with a `Partial(sum)` placement, leading to an inflated result (a sum of norms, rather than the correct norm of the sum). The error can be reproduced with the following script: ```python import os import torch import torch.distributed as dist from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor import DTensor, Partial, Replicate, Shard def setup_distributed(): """Initializes the distributed environment.""" rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) dist.init_process_group("nccl") torch.cuda.set_device(local_rank) print(f"Initialized process {rank}/{world_size} on GPU {local_rank}") return rank, world_size rank, world_size = setup_distributed() assert world_size == 2, "Please run with exactly 2 GPUs for this minimal repro." mesh = init_device_mesh("cuda", (world_size,)) if rank == 0: local_partial = torch.tensor([1.0, 3.0], dtype=torch.float32) else: local_partial = torch.tensor([2.0, 1.0], dtype=torch.float32) partial_dtensor = DTensor.from_local(local_partial, mesh, [Partial("sum")]) partial_result = torch.linalg.vector_norm(partial_dtensor) print( f"[Rank {rank}] partial_result: {partial_result}, full_tensor: {partial_result.full_tensor()}" ) shard_dtensor = partial_dtensor.redistribute(mesh, [Shard(0)]) shard_result = torch.linalg.vector_norm(shard_dtensor) print( f"[Rank {rank}] shard_result: {shard_result}, full_tensor {shard_result.full_tensor()}" ) replicate_dtensor = partial_dtensor.redistribute(mesh, [Replicate()]) replicate_result = torch.linalg.vector_norm(replicate_dtensor) print( f"[Rank {rank}] replicate_result: {replicate_result}, full_tensor {replicate_result.full_tensor()}" ) full_tensor = partial_dtensor.full_tensor() full_result = torch.linalg.vector_norm(full_tensor) print(f"[Rank {rank}] correct_result: {full_result}") ``` Run results show that the norm is `sqrt(12 + 32) + sqrt(22 + 12) = sqrt(10) + sqrt(5) = 5.398` instead of `sqrt(32 + 42) = 5`. ``` $ torchrun --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=3.1622776985168457, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ torchrun --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=2.2360680103302, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` This fix simply forces `reduction_linear=False` for partial placements. The output becomes: ``` $ python -m torch.distributed.run --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ python -m torch.distributed.run --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159856 Approved by: https://github.com/ezyang	2025-10-26 05:58:44 +00:00
Pawel Swider	d97f6550a2	[Intel GPU] Xpu matmul implementation for complex dtype (#160867 ) Enabling complex datatype support for 4 ops: `mm`, `bmm`, `addmm`, `baddbmm` for XPU. From now implementation will call functions created in: https://github.com/intel/torch-xpu-ops/pull/1992. Additionally added complex datatype tests for matmul operators. More detailed tests are going to be enabled in: https://github.com/intel/torch-xpu-ops/pull/1993 CI runs have found that `test_comprehensive_linalg_eig_xpu` tests were calling internally matmul with complex datatype. With this PR test starts to pass so linalg.eig was removed from `inductor_expected_failures_single_sample["xpu"]` as otherwise it was failing with: `Unexpected success` message. Part of: https://github.com/intel/torch-xpu-ops/issues/1853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160867 Approved by: https://github.com/guangyey, https://github.com/ZhiweiYan-96, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/Silv3S, https://github.com/CuiYifeng, https://github.com/jansel	2025-10-25 17:13:13 +00:00
PyTorch MergeBot	516e58965a	Revert "Export flex attention with kwargs and DTensor (#166045 )" This reverts commit `de7fdfe41a`. Reverted https://github.com/pytorch/pytorch/pull/166045 on behalf of https://github.com/malfet due to Broke distributed tests, see `b55b779ad3/1` ([comment](https://github.com/pytorch/pytorch/pull/166045#issuecomment-3446850955))	2025-10-25 15:47:32 +00:00
Chang Pan	74e53d0761	[TorchScript] clearer debug for ConcreteModuleType::findSubmoduleConcreteType (#166192 ) Summary: right now the log is just ``` RuntimeError: it != data_.modules_.end() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/jit/frontend/concrete_module_type.cpp":207, please report a bug to PyTorch. ``` we have no clue where the error happens https://fb.workplace.com/groups/gpuinference/posts/789257990578348/?comment_id=789284783909002&reply_comment_id=789415260562621 Test Plan: UT Reviewed By: jcwchen Differential Revision: D80020093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166192 Approved by: https://github.com/gmagogsfm	2025-10-25 14:07:54 +00:00
Xilun Wu	661a56002f	[AI Codemod][DevmateFBSourceTestFailureBot] Fix for T241916639 ("Your diff, D84932408, broke one test") (#166168 ) Reviewed By: XilunWu Differential Revision: D84983164 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166168 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2025-10-25 06:46:23 +00:00
Jason Ansel	78bcfcf870	[fx] Optimize torch.fx.Node.replace_all_uses_with (#165889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165889 Approved by: https://github.com/aorenste	2025-10-25 03:44:41 +00:00
bobrenjc93	1d58d5fe25	[hops] fix unbacked runtime asserts for cond higher order op (#165893 ) At a high level after this fix we get the following nice tlparse https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/bobren/54a57665-7dcc-41e0-8ca7-df01393cd4aa/custom/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 As seen in this doc, previously we were simply dropping assert post dynamo: https://docs.google.com/document/d/1nRQwvw_gWL0_9T3VKb5Ly3_tNI1fgqG9WtryeD6qaZI/edit?tab=t.0 The fixes are a couple things: 1) Actually run the runtime assertion fx graph pass on subgraphs 2) Reset fake mode unbacked memo across speculate subgraph invocations since the memos actually break the runtime assertion insertions since calls like nonzero end up not allocating new unbacked symints and hence not populating pending_unbacked which then results in incorrect unbacked_bindings on fx_nodes in subgraphs. This is a first step in hardening runtime asserts across all phases of the compiler (eager, aot_eager, inductor, etc.). I will continue kicking tires and fixing bugs until we get runtime assert generations in a good place. One obvious next step is the added test case in this PR fails when compiled with inductor with the following error (NB: it fails before this PR as well): ``` File "/data/users/bobren/a/pytorch/torch/_inductor/ir.py", line 659, in get_dtype return self.dtype torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: AttributeError: 'ShapeAsConstantBuffer' object has no attribute 'dtype' target: cond args[0]: Eq(Mod(s77, 4), 0) args[1]: Subgraph(name='true_graph_0', graph_module=<lambda>(), graph=<torch._inductor.graph.SubgraphLowering object at 0x7fbcbb11e110>) args[2]: Subgraph(name='false_graph_0', graph_module=<lambda>(), graph=<torch._inductor.graph.SubgraphLowering object at 0x7fbcbb21cf70>) args[3]: (s77, TensorBox(StorageBox( ComputedBuffer(name='buf0', layout=FlexibleLayout('cuda:0', torch.float32, size=[s77, s77], stride=[s77, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.float32, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7fbcbb2f37f0>, ranges=[s77, s77])) ))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165893 Approved by: https://github.com/zou3519	2025-10-25 03:25:36 +00:00
Yiming Zhou	de7fdfe41a	Export flex attention with kwargs and DTensor (#166045 ) Fixes #165948 Adding registration of the MaskBlock makes flex attention with kwargs exportable. Also modified unittests to accept kwargs ``` python test/distributed/tensor/test_dtensor_export.py -k test_flex_attention_dtensor_export python test/inductor/test_flex_attention.py -k test_pytree_ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166045 Approved by: https://github.com/drisspg	2025-10-25 03:17:22 +00:00
Natalia Gimelshein	2efcf3ca98	Reverts #163712 and forces allgather/scatter inputs/outputs to be contiguous (#166181 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/166181 Approved by: https://github.com/kwen2501	2025-10-25 02:43:10 +00:00
Animesh Jain	0a5d68d92d	[dynamo] Remove unnecessary NAME_MATCH guard (#166112 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166112 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166155	2025-10-25 01:27:42 +00:00
FFFrog	1d13c314b3	[OpenReg] Remove the Unnecessary Fallback Implementation for AutogradPrivate1 (#165316 ) As the title stated. The fallback for AutogradPrivateUse1 is builtin in PyTorch, so it is no need to register general implementation for out of tree backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165316 Approved by: https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #165315	2025-10-25 01:27:27 +00:00
nick-kuhn	79a4a9c02e	Fix race condition and make CUDA kthvalue deterministic (#165762 ) The gatherKthValue kernel had a race condition where multiple threads could write to the same output location without synchronization when duplicate k-th values exist, resulting in non-deterministic output. Changes: - aten/src/ATen/native/cuda/Sorting.cu: Use atomicMin with shared memory to deterministically find minimum index. Add early termination and remove redundant inRange checks. (We have to cast the index to `int32_t`, but this is already assumed to fit earlier in the kernel.) - aten/src/ATen/native/cuda/Sorting.cpp: Remove non-deterministic alert since kthvalue is now deterministic on CUDA. - torch/__init__.py: Remove kthvalue from non-deterministic operations list and remove kthvalue example from use_deterministic_algorithms() docstring. - test/test_torch.py: Remove test_nondeterministic_alert_kthvalue since kthvalue no longer raises alerts on CUDA. Benefits: - Deterministic: always returns minimum index when duplicates exist - Potential performance improvement on large arrays with repetitions Test Results: - All existing PyTorch tests pass (test_kthvalue) - Custom determinism tests confirm consistent results - Custom CUDA vs CPU correctness validated across 50+ scenarios - Custom performance benchmarks show improvements with no visible regressions Addresses #165227 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165762 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-25 00:45:57 +00:00
Malay Bag	82473c3d59	[torch.export] Add original module type to UnflattenedModule class (#166145 ) Summary: Currently all sub modules of UnflattenedModule have orginal type name. This diff will orginal type for UnflattenedModule. Test Plan: ``` buck test mode/opt caffe2/test:test_export ``` https://www.internalfb.com/intern/testinfra/testrun/17732923654320197 Differential Revision: D85373454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166145 Approved by: https://github.com/angelayi	2025-10-24 22:47:29 +00:00
Ti-Tai Wang	b04173be9b	[ONNX] Add a test to backed_size_oblivious patch in onnx (#166196 ) Follow-up https://github.com/pytorch/pytorch/pull/166151 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166196 Approved by: https://github.com/justinchuby	2025-10-24 22:47:10 +00:00
Blaine Burton Rister	0442125362	[Inductor] Restore original dtype for rank-0 CPU tensors (#166118 ) # Problem Inductor implicitly upcasts certain rank-0 kernel arguments from float16 to float32. Currently, this happens only on the `"cpu"` device, which appears to be related to float16 support in CPU Triton. However, it can also affect the behavior of GPU kernels, when a model contains tensors from multiple devices. Upcasting may be undesirable on some platforms, so users can typically disable it with the `config.triton.codegen_upcast_to_fp32` flag. However, this flag was not respected by the rank-0 kernel argument codepath. Through an improbable series of events, float32 upcasting caused an internal model to fail compilation on MTIA. (Internal reviewers see T242444110.) # Fix If `config.triton.codegen_upcast_to_fp32` evaluates to `False`, cast the kernel argument to the original dtype. # Test plan Added a new CI test checking for the downcast iff the config flag is false. The test mixes GPU and CPU tensors to generate a GPU kernel with the implicit float32 upcast and explicit float16 downcast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166118 Approved by: https://github.com/jfix71, https://github.com/jansel, https://github.com/kundaMwiza	2025-10-24 19:59:25 +00:00
Isalia20	fa6d911dda	[MPS] Sparse mul enable tests and fix on MPS (#166164 ) Apparently mul tests in test_sparse were disabled. The dense representation i.e. when nnz is not a scalar was broken on MPS. This PR fixes it and enables the tests in test_sparse.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/166164 Approved by: https://github.com/malfet	2025-10-24 18:30:30 +00:00
eqy	60ac039998	[CUDA][Grouped Gemm] remove `xFail` on Group GEMM tests after fallback was added (#165378 ) https://github.com/pytorch/pytorch/pull/162059 means we get unexpected successes now on e.g., SM 12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165378 Approved by: https://github.com/Skylion007	2025-10-24 17:42:40 +00:00
PyTorch MergeBot	380d440d1c	Revert "inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040 )" This reverts commit `9038a30cee`. Reverted https://github.com/pytorch/pytorch/pull/164040 on behalf of https://github.com/karthickai due to Kindly add the test case mentioned in the issue ([comment](https://github.com/pytorch/pytorch/pull/164040#issuecomment-3444137989))	2025-10-24 17:14:45 +00:00
Jupiter-Guy	9038a30cee	inductor: avoid unrolling argmin/argmax reductions to preserve index … (#164040 ) …semantics on views; add regression test for transposed mutation (fixes #163929) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164040 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-10-24 16:37:43 +00:00
PyTorch MergeBot	690c8c13b9	Revert "Export should use aot_export_joint_with_descriptors (#165931 )" This reverts commit `882b834082`. Reverted https://github.com/pytorch/pytorch/pull/165931 on behalf of https://github.com/clee2000 due to breaking internal tests D85084301 for test_auto_functionalize? I checked that they did run on OSS CI so I'm not entirely sure whats going on, I assume its the IS_FBCODE stuff ([comment](https://github.com/pytorch/pytorch/pull/165931#issuecomment-3443887361))	2025-10-24 16:02:20 +00:00
PyTorch MergeBot	28ee6b62ed	Revert "[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 )" This reverts commit `5a4997dcae`. Reverted https://github.com/pytorch/pytorch/pull/163358 on behalf of https://github.com/clee2000 due to probably need to revert this one too, its stacked with https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389 ([comment](https://github.com/pytorch/pytorch/pull/163358#issuecomment-3443874910))	2025-10-24 15:58:54 +00:00
PyTorch MergeBot	81577bdb3f	Revert "[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 )" This reverts commit `8625ffbd45`. Reverted https://github.com/pytorch/pytorch/pull/166003 on behalf of https://github.com/clee2000 due to failing internal tests D85405179 I believe there are uses of _flatten_mesh_list internally that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/166003#issuecomment-3443668389))	2025-10-24 15:14:23 +00:00
eellison	27af8480ea	Refactor api and configs of overlapping (#166130 ) - pass important configs values directly into the class - migrate those configs from `test_configs` to another class - add an (off by default) config to enable inside inductor, instead of requiring a custom post pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/166130 Approved by: https://github.com/bdhirsh	2025-10-24 07:03:54 +00:00
Pian Pawakapan	6494cdc40c	[DebugMode] add nn.Module tracking (#165498 ) Uses ModTracker to record nn.Module entries, much like CommDebugMode. Can be switched on with `DebugMode(record_nn_module=True)`: ``` [nn.Mod] Bar [nn.Mod] Bar.abc [nn.Mod] Bar.abc.l1 aten::t(t: f32[4, 4]) aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4]) [nn.Mod] Bar.abc.l2 aten::t(t: f32[4, 4]) aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4]) [nn.Mod] Bar.xyz aten::t(t: f32[4, 4]) aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])""" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165498 Approved by: https://github.com/SherlockNoMad	2025-10-24 05:08:33 +00:00
fduwjj	5a4997dcae	[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 ) Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users. One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358 Approved by: https://github.com/fegin ghstack dependencies: #166003	2025-10-23 23:31:17 +00:00
Tugsbayasgalan Manlaibaatar	882b834082	Export should use aot_export_joint_with_descriptors (#165931 ) This diff moves export run_decompositions to use aot_export_joint_with_descriptors instead of aot_export_module. Doing so, i ran into 2 main bugs: 1) aot_export_joint_with_descriptors don't correctly pass in record_nn_module_stack flag that is needed to populate nn_module_stack by switching the internal tracer. 2) When creating symint with negative inputs, we need to pass in positive=False. This didn't matter before because aot_autograd directly returns integer inputs instead of creating symint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165931 Approved by: https://github.com/zhxchen17	2025-10-23 22:42:11 +00:00
fduwjj	8625ffbd45	[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 ) Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code. We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166003 Approved by: https://github.com/Skylion007, https://github.com/fegin	2025-10-23 20:49:59 +00:00
PyTorch MergeBot	75b8295868	Revert "Warn if AccumulateGrad stream does not match producer node stream (#165065 )" This reverts commit `12f742941d`. Reverted https://github.com/pytorch/pytorch/pull/165065 on behalf of https://github.com/clee2000 due to broke internal builds D85273204 usages of TORCH_API void add need to be updated? ([comment](https://github.com/pytorch/pytorch/pull/165065#issuecomment-3438061854))	2025-10-23 17:02:49 +00:00
PyTorch MergeBot	baf91bbbfc	Revert "[inductor][choices] lookup table choices 1/3 (#164978 )" This reverts commit `ab9e466928`. Reverted https://github.com/pytorch/pytorch/pull/164978 on behalf of https://github.com/malfet due to Looks like it broke slow tests, see `cbcb4f7768/1` ([comment](https://github.com/pytorch/pytorch/pull/164978#issuecomment-3437424559))	2025-10-23 14:47:07 +00:00
Phil Hu	cbcb4f7768	[pytorch][torchelastic] Duplicate stdout and stderr and apply custom filter in torchrun (#160712 ) Summary: Part of an effort to extract some important error logs (e.g. [#157996](https://github.com/pytorch/pytorch/pull/157996)) that was `tee`'ed to `stdout` and `stderr`. The general idea is to: - Duplicate the `tee`s on `stdout` and `stderr` to a separate file, `filtered_stdout.log` and `filtered_stderr.log`, respectively. - In these files, as its name suggests, only log lines matching a customizable filter. - Later on in another PR, append the contents of these files to the reply file. Outline of changes in this PR: - Enhance `TailLog` to be able to 1) stream to a file, and 2) only write when the line matches the passed filter. - Add `filtered_stdout` and `filtered_stderr` to `LogsDest` and have `LogsSpecs` `reify` them. - In `start_processes()` and `PContext`, add params `duplicate_stdout_filters` and `duplicate_stderr_filters` to filter and write the duplicated stream to the files above. When no filters are passed in, no duplicated streams are created. Test Plan: ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:api_test ``` ``` Buck UI: https://www.internalfb.com/buck2/f5c6b7da-217d-4a0b-872a-c7cd3d05587f Test UI: https://www.internalfb.com/intern/testinfra/testrun/4222124951617688 Network: Up: 398B Down: 44MiB (reSessionID-a489a961-b602-45be-b851-3490ebb7a26a) Analyzing targets. Remaining 0/200 Executing actions. Remaining 0/12856 0.1s exec time total Command: test. Finished 1 local Time elapsed: 17:37.9s Tests finished: Pass 52. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` $ buck test 'fbcode//mode/opt' caffe2/test/distributed/elastic/multiprocessing:tail_log_test ``` ``` Buck UI: https://www.internalfb.com/buck2/d6d5c1c1-db98-4d9c-b608-7ba6fbb5e3ee Test UI: https://www.internalfb.com/intern/testinfra/testrun/13510798985149262 Network: Up: 94KiB Down: 417MiB (reSessionID-27b46fba-d31c-4c04-8ede-a506454e6922) Analyzing targets. Remaining 0/3 536 actions, 555 artifacts declared Executing actions. Remaining 0/186 1:05.5s exec time total Command: test. Finished 7 local, 1 remote, 115 cache (93% hit) 37.0s exec time cached (56%) Time elapsed: 1:11.5s Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Rollback Plan: Differential Revision: D80188995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160712 Approved by: https://github.com/fduwjj	2025-10-23 14:22:21 +00:00
Shunting Zhang	673060beae	[inductor] turn Inductor deterministic mode on with torch.use_deterministic_algorithms (#165950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165950 Approved by: https://github.com/v0i0, https://github.com/eellison	2025-10-23 02:48:42 +00:00
Animesh Jain	2e8e9a59a8	Revert "[dynamo][easy] Support torch.accelerator.current_accelerator (#165734 )" (#166094 ) This reverts commit `c18ddfc572`. Discovers some latent issues causing internal failures. Will fix those issues first and resend the PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/166094 Approved by: https://github.com/bdhirsh	2025-10-23 01:24:46 +00:00
Tugsbayasgalan Manlaibaatar	fb277a5916	Enable new tracer by default (#165332 ) Differential Revision: [D84516080](https://our.internmc.facebook.com/intern/diff/D84516080) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165332 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #165582, #163580	2025-10-23 00:40:29 +00:00
Natalia Gimelshein	73fa0d0c63	test for #165446 (#165853 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/165853 Approved by: https://github.com/drisspg	2025-10-23 00:08:18 +00:00
Eddie Yan	e64a814ae7	[CUDA] Add experimental green context support for SM carveout (#159104 ) Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here... Built on top of @drisspg 's branch Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104 Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/kwen2501 Co-authored-by: drisspg <drisspguessous@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-10-22 21:38:52 +00:00
zhxchen17	757975ad50	[export] Unified graph capture with fullgraph_capture. (#165562 ) Summary: _dynamo_graph_capture_for_export in the current form has the compability issue with the main torch.compile() path despite we reuse fullgraph_capture as the bytecode tracer. The reason is that we flip on many export specific flags and even trace with a wrapped function which will cause divergence with torch.compile() again. This PR instead creates a new implementation of dynamo_graph_capture_for_export which 100% relies on fullgraph capture and post-processing on CaptureOutput so that we can avoid the inversion of phases in PT2 compiler stack. This also benefits precompile workflow since we want to have a feature that only accepts pytree inputs and ship portable python wrappers in package. In other words, I think the code here is sharable between export and precompile for exporting portable graph. Test Plan: ===================================================================== test session starts ===================================================================== platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0 rootdir: /data/users/zhxchen17/pytorch configfile: pytest.ini plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0 collected 9 items Running 9 items in this shard test/distributed/tensor/test_dtensor_export.py ........x [100%] ================================================================ 8 passed, 1 xfailed in 11.42s ================================================================ Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165562 Approved by: https://github.com/tugsbayasgalan	2025-10-22 20:44:55 +00:00
Pian Pawakapan	82ef1b5db3	[DebugMode] refactor logs into _DebugCalls (#165376 ) Refactors `DebugMode.operators` to be more structured `_DebugCall` objects, instead of (op, args, kwargs, call_depth) tuples. Useful going forward for attaching more information (e.g. output info, call metadata). Is BC-breaking, but attaches an `__iter__` method for `_OpCall` and `_RedistributeCall` so previous tuple usage is accessible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165376 Approved by: https://github.com/yushangdi	2025-10-22 19:01:56 +00:00
soulitzer	12f742941d	Warn if AccumulateGrad stream does not match producer node stream (#165065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165065 Approved by: https://github.com/ngimel	2025-10-22 17:33:27 +00:00
James Wu	35180fafee	Allow GraphPickler to pickle graph modules containing AOTCompiled subgraphs (#165844 ) This PR allows GraphPickler to pickle aot_eager graph modules that have regional inductor bits in them, with a few exceptions: - FlexAttentionBackward isn't marked cacheable, so those tests don't work immediately since we're not sure how to serialize it. But it's safe to serialize/cache, so the next PR fixes those unit tests. - It seems that when reloading a GraphPickled object, we don't recompile subgraphs. Will investigate this in a future PR All unit tests in test_regional_inductor are parameterized so that we try serializing and deserializing the returned graph module before returning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165844 Approved by: https://github.com/oulgen ghstack dependencies: #165843	2025-10-22 17:03:49 +00:00
Ruben Rodriguez Buchillon	ab9e466928	[inductor][choices] lookup table choices 1/3 (#164978 ) \# why - enable users to control which choices get used on which inputs - reduce lowering time, and pin kernel selection, by selecting them for the inputs \# what - a new InductorChoices subclass that implements a lookup table - a README explaining the usage - corresponding testing - currently only supports templates that go through `V.choices.get_template_configs` \# testing ``` python3 -bb -m pytest test/inductor/test_lookup_table.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164978 Approved by: https://github.com/PaulZhang12, https://github.com/eellison	2025-10-22 16:11:31 +00:00
Yidi Wu	af4ba78543	[scan x vmap] support scan in vmap (#165580 ) This is required by the chunked_with_scan work where two nested vmap(vmap) with chunk sizes > 1 are invoked, which produces a scan-> vmap -> scan -> vmap chain and we need to handle the case of vmap(scan) and scan(vmap). The way we handle vmap(scan) is to turn it into scan(vmap(combine_fn)). The idea being that the combine_fn no longer do the combine_fn for a single slice, it vmaps over the combine_fn and do multiple combine_fns in one step. We need to need know how combine_fn propagates the batched tensor and what are the batched dims of the output. For this purpose, we use restore_vmap to give us the out_dims information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165580 Approved by: https://github.com/zou3519 ghstack dependencies: #165675	2025-10-22 09:46:00 +00:00
Pearu Peterson	d01f15152c	Move toUnderlying to headeronly (#165694 ) As in the title. Required in upper PRs of this ghstack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165694 Approved by: https://github.com/janeyx99	2025-10-22 05:31:16 +00:00
Pearu Peterson	4fae6968b1	Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405 ) (#166018 ) This PR is created to replace the reverted PR https://github.com/pytorch/pytorch/pull/164405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166018 Approved by: https://github.com/janeyx99	2025-10-22 05:16:58 +00:00
Yuanyuan Chen	f9953e0f61	Enable PLC0414 on ruff (#165828 ) This PR enables `PLC0414` that fixes redundant import aliases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165828 Approved by: https://github.com/albanD	2025-10-22 04:56:52 +00:00
Jagadish Krishnamoorthy	34ed7a8f0d	[ROCm] Skip test_blockwise_nvfp4_with_global_scale (#165968 ) Disable the fp4 global_scale test till the feature is enabled on ROCm. Fixes #166027. Not really, but we're trading an issue for a test skip decorator since the test is parameterized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165968 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2025-10-22 04:23:05 +00:00
Rob Timpe	550e3e6efb	[dynamo] Fix MATCH_KEYS for dict pattern matching (#165956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165956 Approved by: https://github.com/guilhermeleobas, https://github.com/cyyever	2025-10-22 02:52:07 +00:00
inventshah	715449ca76	[MPS] Fix parity between CPU and MPS on singular matrices in linalg.lu_factor (#165871 ) Fixes #165870. Follow up from #165254. This PR [a] removes the MPS specific version of `lu_factor` in favor of the version in BatchedLinearAlgebra.cpp which uses `lu_factor_ex`, and [b] updates `lu_factor_ex` error codes to match expectations. When `lu_factor` was first implemented for MPS (#99269), it bypassed the implementation in BatchedLinearAlgebra.cpp since we did not have `lu_factor_ex`. Since #144651 implements `lu_factor_ex`, we can now remove the MPS specific wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165871 Approved by: https://github.com/kulinseth, https://github.com/albanD	2025-10-22 02:48:40 +00:00
arkadip-maitra	84d8d06fc3	Fixes floating point exception in torch.nn.PixelShuffle (#163154 ) Fixes #162251 Previous Output: `Floating point exception (core dumped)` Now Output: `RuntimeError: upscale factor is too large, (upscale_factor}^2 overflowed: upscale_factor=545460846592` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163154 Approved by: https://github.com/cyyever, https://github.com/albanD	2025-10-22 02:22:16 +00:00
Artem Kuzmitckii	e13580e41c	[AMD] Run int4_mm tests only for compatible arch (#165630 ) Such tests should be skipped for rest including gfx1100(Navi3x) Fixes for CI HUD for gfx1100 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165630 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-10-22 01:38:55 +00:00
Artem Kuzmitckii	f3b8e15f20	[AMD][gfx1100] test_decompose_mem_bound_mm.py tolerance increase (#165625 ) test_decompose_mem_bound_mm.py tolerance increase for navi3x(gfx11x) (cherry picked from commit 03c7da05f61890bbf5ae41e23c8df6d5f6805bac) from Fixes for CI HUD for gfx1100 Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165625 Approved by: https://github.com/jeffdaily Co-authored-by: iupaikov-amd <Iurii.Paikov@amd.com> Co-authored-by: Dmitry Nikolaev <139769634+dnikolaev-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-22 01:38:48 +00:00
Nikita Shulga	5211f4c108	[MPS] Fix SDPA fp16 overflow (#165961 ) Do not cast intermediate result back to lower precision data data until softmax is finished, otherwise it might produce NaN Adjust the test to use 256 as filler value rather than 64 Fixes https://github.com/pytorch/pytorch/issues/160841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165961 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #165960	2025-10-22 01:29:42 +00:00
PyTorch MergeBot	7773a22cdb	Revert "[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 )" This reverts commit `4be1e3bf92`. Reverted https://github.com/pytorch/pytorch/pull/165221 on behalf of https://github.com/clee2000 due to I think this broke test_openreg [GH job link](https://github.com/pytorch/pytorch/actions/runs/18698271058/job/53322459496) [HUD commit link](`4be1e3bf92`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165221#issuecomment-3430012693))	2025-10-22 00:26:57 +00:00
KarhouTam	4be1e3bf92	[AMP][Refactor] Autocast dtype handling to simplify device-specific c… (#165221 ) This PR refactors the autocast context manager in autocast_mode.py to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping device_supported_dtypes is used to associate device types with their supported dtypes, and the validation logic is unified. The former PR #163446 was merged but reverted due to failed CI test on `openreg` related tests. This RR additionally slightly modified some test assertions for passing the CI tests. CI failed due to assertion for the exactly same error message. For example: ``` File "/var/lib/jenkins/workspace/test/cpp_extensions/open_registration_extension/torch_openreg/tests/test_autocast.py", line 9, in test_autocast_with_unsupported_type with self.assertWarnsRegex( AssertionError: "In openreg autocast, but the target dtype torch.float32 is not supported." does not match "In openreg autocast, but the target dtype is not supported. Disabling autocast." ``` Sorry for the inconvenience again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165221 Approved by: https://github.com/FFFrog, https://github.com/albanD	2025-10-21 21:32:12 +00:00
Catherine Lee	e7592f4005	[CI] Move the periodic debug tests to newer runner (#165158 ) Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(args, kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const, char const, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 #7 THCPStream_synchronize(_object, _object*) from Stream.cpp:0 #8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 #9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 #11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165158 Approved by: https://github.com/seemethere	2025-10-21 21:28:12 +00:00
Isalia20	d334c3649d	[CUDA] fix reflection padding for large batch size (#165942 ) Fixes [#165861](https://github.com/pytorch/pytorch/issues/165861) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165942 Approved by: https://github.com/eqy	2025-10-21 21:07:38 +00:00
Nikita Vedeneev	2f38eece7c	[CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955 ) As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths). Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-10-21 20:48:12 +00:00
Animesh Jain	830e789a55	[dynamo][annotate] Graph break cleanly on fx.traceback.annotate reconstruction (#166006 ) This avoids generation of bad bytecode, leading to really confusing error. I am not sure why we can't reconstruct cleanly, it has to do with the input being a dict, while other supported ctx managers take bools. Fixing that is for another day. Lets give a good error message for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166006 Approved by: https://github.com/yushangdi, https://github.com/SherlockNoMad	2025-10-21 20:48:04 +00:00
PyTorch MergeBot	ad4dc52bf6	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit `4e643422f6`. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3429426503))	2025-10-21 20:24:14 +00:00
Bruce Chang	4e643422f6	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-21 19:47:33 +00:00
Jason Ansel	3c3b278872	[reland][fx] Move Node._prepend/Node._remove_from_list to C++ (#165882 ) Relands #148261 that was reverted by #150542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165882 Approved by: https://github.com/ezyang	2025-10-21 19:43:55 +00:00
Nikita Shulga	0bd12c1168	[CI] Extend test_transfomers to MPS (#165960 ) Just skip grad_checks as they need float64 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165960 Approved by: https://github.com/Skylion007	2025-10-21 19:27:44 +00:00
Tugsbayasgalan Manlaibaatar	2fc5e45a41	better error message when there is no pytree impl (#165955 ) Differential Revision: [D85117597](https://our.internmc.facebook.com/intern/diff/D85117597) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165955 Approved by: https://github.com/avikchaudhuri	2025-10-21 18:49:22 +00:00
PyTorch MergeBot	6c4412f72b	Revert "[Inductor] support masked vectorization for the tail_loop for float64 datatype (#163316 )" This reverts commit `e9d8973427`. Reverted https://github.com/pytorch/pytorch/pull/163316 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00
PyTorch MergeBot	78bf6186f2	Revert "[Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324 )" This reverts commit `e8cb34dd52`. Reverted https://github.com/pytorch/pytorch/pull/163324 on behalf of https://github.com/clee2000 due to seems to have broken some no_gpu tests? test/inductor/test_cpu_repro.py::CPUReproTests::test_double_reduction_vec [GH job link](https://github.com/pytorch/pytorch/actions/runs/18689033019/job/53290772740) [HUD commit link](`e9d8973427`) ([comment](https://github.com/pytorch/pytorch/pull/163316#issuecomment-3428210509))	2025-10-21 17:44:42 +00:00

1 2 3 4 5 ...

37196 Commits