pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

Author	SHA1	Message	Date
Mikayla Gawarecki	6ecd6b23b6	Document limitations of weights_only in SECURITY.md and torch.load doc (#165645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165645 Approved by: https://github.com/albanD	2025-10-27 18:20:50 +00:00
Sarthak Tandon	3f69b4d9b4	[ROCm][tunableop] Fixes flaky test issue (#166084 ) Fixes #165603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166084 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-27 18:13:30 +00:00
Shunting Zhang	a04edcb27a	[inductor] a few workspace api change (#166204 ) A few workspace API changes: 1. return outer name when creating. Usually a use case does not care about outer name. But for mix-order-reduction (stacked PR), we need it to do the next-layer of reduction on the workspace tensor 2. be able to override workspace tensor dtype 3. be able to delay the deallocation of workspace tensors in TritonKernel.call_kernel since they may be used after the call. The lifetime of the workspace tensors are only enlarged a little bit. They would be deallocated once the next layer reduction is done. Test with the stacked PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166204 Approved by: https://github.com/jansel	2025-10-27 18:10:23 +00:00
anwang	eb2bad5bb5	[Inductor] Make combo kernel MAX_NUM_ARGS configurable (#166274 ) The MAX_NUM_ARGS of ComboKernel is currently a fixed number. We need to tune this number to avoid large fusion for MTIA, thus making it configurable. Differential Revision: [D85509352](https://our.internmc.facebook.com/intern/diff/D85509352/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166274 Approved by: https://github.com/eellison	2025-10-27 18:06:38 +00:00
Catherine Lee	a076b4d7ac	Use std::min for #166021 (#166195 ) Summary: Attempting to forward fix failures from D85405167 (PR https://github.com/pytorch/pytorch/pull/166021) This is devmates suggestion and seems to work, but idk if it's a good idea or not. Devmate says it's getting resolved to at::min which is host only, and it doesn't happen in OSS is likely because `AT_PER_OPERATOR_HEADERS` is defined in OSS but not internally. ``` In file included from .../ATen/native/hip/Normalization.hip:11: .../ATen/native/hip/Normalization.cuh:302:37: error: no matching function for call to 'min' 302 \| v_[u] = input[batch][plane][min(x+u*blockDim.x, input.size(2)-1)]; \| ^~~ ``` Differential Revision: D85463674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166195 Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/eqy	2025-10-27 17:57:44 +00:00
PyTorch MergeBot	a988510c33	Revert "Simplify the CUPTI CMake check for kineto (#161370 )" This reverts commit `e67e3d95f3`. Reverted https://github.com/pytorch/pytorch/pull/161370 on behalf of https://github.com/atalman due to Sorry this is failing libtorch nightly builds [pytorch/pytorch/actions/runs/18800131287/job/53653414136](https://github.com/pytorch/pytorch/actions/runs/18800131287/job/53653414136) ([comment](https://github.com/pytorch/pytorch/pull/161370#issuecomment-3452400982))	2025-10-27 17:05:59 +00:00
Animesh Jain	99e07c39ec	[dynamo][misc] Replace UserFunctionVariable with VariableTracker build (#165707 ) Audit: To prevent future issues with functools.partial or callable objects. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165707 Approved by: https://github.com/Lucaskabela ghstack dependencies: #166251	2025-10-27 16:47:32 +00:00
Animesh Jain	610c09f8f4	[dynamo] Fix python_type for UserDefinedClassExceptionVariable (#166251 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166251 Approved by: https://github.com/Lucaskabela	2025-10-27 16:47:32 +00:00
Animesh Jain	61bad3c1ea	[dynamo] Move some FUNCTION_MATCH to CLOSURE_MATCH (#166244 ) Closure match is more relaxed than FUNCTION_MATCH (which is ID_MATCH) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166244 Approved by: https://github.com/Lucaskabela	2025-10-27 16:43:46 +00:00
linhaifeng	f89a7e9fe8	[1/N][Fix] Fix typo in aten folder (#166126 ) Fix typo in aten folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166126 Approved by: https://github.com/cyyever, https://github.com/slayton58	2025-10-27 15:34:39 +00:00
fduwjj	f2c81635c8	[DeviceMesh][2D] Use concatenate for 2D (FSDP+TP) instead of getting from root mesh (#165492 ) With concatenate API, we can directly combine two meshes together rather than getting the spmd mesh from root. Differential Revision: [D85409698](https://our.internmc.facebook.com/intern/diff/D85409698) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165492 Approved by: https://github.com/fegin ghstack dependencies: #163358	2025-10-27 15:33:21 +00:00
Nicolas De Carli	e214af6ae8	[Pytorch] Improve float32 erf() on aarch64 (#166262 ) Summary: The float32 data type has a vectorized routine that computes erf(). Such function currently calls std::exp() individually for each float on the vector being processed. We now use sleef's vectorized routine to compute exp, improving performance of erf. AVX2/AVX512 also have a custom erf implementation, which uses sleef to compute exp. We've observed a throughput increase of 25%, when tested on tensors containing 1M elements Before: f32 erf: 3175.977us After: f32 erf: 2539.446us Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D85522651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166262 Approved by: https://github.com/fadara01, https://github.com/jgong5, https://github.com/aditew01	2025-10-27 14:55:38 +00:00
Bin Bao	7ce723d21c	[AOTI] Remove c10 as linked library (#165489 ) Summary: AOTI compilation doesn't depend on c10 now. It should only depend on C shim symbols which live in libtorch_cpu or libtorch_cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165489 Approved by: https://github.com/yushangdi	2025-10-27 13:53:44 +00:00
PyTorch UpdateBot	4295a9a158	[xla hash update] update the pinned xla hash (#165895 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165895 Approved by: https://github.com/pytorchbot	2025-10-27 11:47:29 +00:00
PyTorch UpdateBot	90d7be35e9	Update slow tests (#165894 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165894 Approved by: https://github.com/pytorchbot	2025-10-27 11:42:14 +00:00
Oguz Ulgen	8d4e48831e	Remove JITFunction constexpr and some arg_names (#166280 ) https://github.com/triton-lang/triton/pull/8536 breaks torch.compile integration. This PR attempts to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166280 Approved by: https://github.com/jansel	2025-10-27 09:29:03 +00:00
Cui, Yifeng	90b30ebf7e	Update torch-xpu-ops commit pin (#166129 ) Update the torch-xpu-ops commit to [intel/torch-xpu-ops@8d373b](`8d373ba272`), includes: - Add CONFIGURE_DEPENDS in install_xpu_headers macro to track these headers - Add check to ensure P2P Tensors are dense - Switch philox_engine_inputs usage to philox_xpu_state per XPU graph request - Add vectorization path for maxpool backward channel last - Fix SYCL_PRINT macro usable on Windows - Eliminate unnecessary warning if no AOT enabled Pull Request resolved: https://github.com/pytorch/pytorch/pull/166129 Approved by: https://github.com/EikanWang	2025-10-27 08:17:03 +00:00
Yuxin Wu	173bcda436	Quick fix of torch.save memory leak (#165204 ) Fix the memory leak shown in https://github.com/pytorch/pytorch/issues/149846#issuecomment-3392634572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165204 Approved by: https://github.com/ezyang	2025-10-27 07:50:58 +00:00
fduwjj	6530bc70fb	[DeviceMesh] Implement a device mesh concatenate api for submesh and SPMD use case (#163358 ) Today FSDP needs to slicing out spmd mesh from root mesh here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L301. But essentially, users want is a concatenate of some submesh into a big mesh and used as a spmd mesh. This PR is tentatively trying to implement this API for users. One thing to note is that, all sub-mesh needs to slicing/flatten or unflatten from same root mesh otherwise the indices make no sense when it comes to mesh indexing and device allocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163358 Approved by: https://github.com/fegin	2025-10-27 07:39:21 +00:00
bobrenjc93	4c38887346	[rfc] add debug mode to print meta in fx graphs (#165874 ) quite useful in debugging things like unbacked bindings (and presumably other mechanisms that dependent on meta including activation checkpointing and stack trace printing) <img width="3996" height="748" alt="CleanShot 2025-10-21 at 09 41 54@2x" src="https://github.com/user-attachments/assets/8b885a36-54a5-48b4-a23c-80b39ac7eb12" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165874 Approved by: https://github.com/ezyang ghstack dependencies: #165893	2025-10-27 07:20:28 +00:00
Deng, Daisy	81fa4a204c	Enable Intel GPU on 4 unit test cases (#165405 ) For https://github.com/pytorch/pytorch/issues/114850, we will port some aten unit tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests 2. Added allow_xpu=True for supported test class in test parameterization. 3. Use torch.accelerator to extend cude specific test to XPU if needed. 4. Enabled 'xpu' for some test pathes Pull Request resolved: https://github.com/pytorch/pytorch/pull/165405 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-10-27 06:06:07 +00:00
Nikita Shulga	4e6afa8c07	[BE][Opinfo] Mark `[c]double` as unsupported for MPS (#166213 ) Test plan: Run `python ../test/test_ops.py -v -k test_dtypes___radd___mps` when TestCommon parametrization is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/166213 Approved by: https://github.com/kulinseth, https://github.com/Skylion007	2025-10-27 05:38:36 +00:00
Yuanyuan Chen	79aa88cc5d	Remove old ROCm version checks and branches (#166111 ) This PR removes outdated ROCm version checks and their branches. While there is no explicit mention of minimum supported version. ROCm 6.4 is listed in the installation page and the CI yaml files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166111 Approved by: https://github.com/ezyang	2025-10-27 05:32:54 +00:00
Weinan Liu	fa4cb91846	add support for ir scalar literal parsing for inf/-inf/True/False (#163924 ) Currently the ir parser doesn't support parse ir like ``` graph(): %12 : float = prim::Constant[value=-inf]() %13 : float = prim::Constant[value=inf]() %14 : bool = prim::Constant[value=True]() %15 : bool = prim::Constant[value=False]() return (%12) ``` So the python script below will throw error. ``` #!/bin/env python import torch def test(): return [True, False] f = torch.jit.script(test) torch._C._jit_pass_constant_propagation(f.graph) ts_str = f.graph.__repr__() print(ts_str) ts = torch.parse_ir(ts_str) func = torch._C._create_function_from_graph("forward", ts) ret = func() assert ret == [True, False] def test(): return [float("inf"), float("-inf")] f = torch.jit.script(test) torch._C._jit_pass_constant_propagation(f.graph) ts_str = f.graph.__repr__() print(ts_str) ts = torch.parse_ir(ts_str) func = torch._C._create_function_from_graph("forward", ts) ret = func() assert ret == [float("inf"), float("-inf")] ``` I add "inf" and bool cases for IRParser::parseScalarLiteral in irparser.cpp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163924 Approved by: https://github.com/ezyang	2025-10-27 05:10:21 +00:00
Jiong Gong	c58d0ad85d	Propose Out-of-tree Backend Integration (PrivateUse1) as a module and FFFrog as the maintainer (#165958 ) I'd like to propose a new module `Out-of-tree Backend Integration` via `PrivateUse1` device key. The out-of-tree backend integration via `PrivateUse1` device key has been a recommended mechanism of plug-in third-party accelerator devices into PyTorch. There are already quite a few documents/tutorials on the usage with the primary one as https://docs.pytorch.org/docs/main/accelerator/index.html. We also saw more and more HW vendors to leverage the `PrivateUse1` mechanism to support their accelerators. For example: 1. Ascend NPU 2. Microsoft MAIA 3. MooreThreads MUSA 4. Cambricon MLU The scope of `PrivateUse1` based out-of-tree backend integration is composed of two parts: 1. `PrivateUse1` device as an out-of-tree backend that involves: (a) make `PrivateUse1` a function-complete device as other in-tree devices: i.e., device runtime, autograd, autocast, profiling, distributed, quantization etc. (b) a pluggable design to allow out-of-tree integration to extend the functionality of `PrivateUse1` such as a backend registration mechanism that allows user-friendly device naming, runtime extension points with either C++ and Python for third-party to plug-in their runtime implementation, customizable tensor implementation for third-party to add extra info/functionality to the tensor and their serialization. 2. OpenReg: A test suite and documentation effort to guarantee the functional correctness of `PrivateUse1` mechanism and to guide HW vendors with the right implementation. I'm also proposing @FFFrog as the module maintainer for this new module due to his continuous contribution to the design and implementation both parts of the module. Below are the RFCs/Feature Proposals @FFFrog was working on: 1. [An improvement of PrivateUse1 mechanism, facilitating third-party backend integration](https://docs.google.com/document/d/1_2EO5A2Ww3xDwqbhIvs9Nk65-jV0oNYg3XAmNUsHdAY/edit?tab=t.0#heading=h.5vt8c1vo4dc7) 2. [The interoperability Standard of Third-party Backend Integration Mechanism](`9bd181e742/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md`) 3. [PyTorch Backend Accelerator Integration Verification and Guidance](`f6048cbd4f/RFC-0045-PyTorch-Accelerator-Integration-Enhancements.md`) @FFFrog contributed 240+ PRs and a majority of them is related to `PrivateUse1`. (https://github.com/pytorch/pytorch/pulls?q=is%3Apr+author%3Afffrog+). He also reviewed 50+ PRs related to this area. He is also the primary author of OpenReg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165958 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/ezyang	2025-10-27 05:00:15 +00:00
fduwjj	000f49551b	[DeviceMesh] Use _flatten_rank_map to replace _flatten_mesh_list so that we don't need to compare root mesh (#166003 ) (#166264 ) Summary: Since we are already share a flattened tensor `_rank_map` across all meshes from a same root mesh, we can just use a flattened list of it to replace the comparison of root_mesh and flattened_mesh_list (because with same _rank_map and layout, the mesh tensor is guaranteed to be the same). This way we can also give back the CPU overhead added in https://github.com/pytorch/pytorch/pull/164510 and further simply the code. We do have a more ambitious universe-based change here: https://github.com/pytorch/pytorch/pull/165680 but it needs more discussions and would lead to BC breaking. We might eventually merge that PR but probably not now and this is a change which is not BC breaking and will help concatenate and 2D integration with concatenate. cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta msaroufim dcci imported-using-ghimport Test Plan: Imported from OSS Differential Revision: D85526705 Pulled By: fduwjj Pull Request resolved: https://github.com/pytorch/pytorch/pull/166264 Approved by: https://github.com/XilunWu	2025-10-27 03:15:15 +00:00
Maggie Moss	9940e894ea	Fix pyrefly ignore syntax in _inductor (#166247 ) Ensures pyrefly ignores only ignore the intended error code. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166247 Approved by: https://github.com/oulgen	2025-10-27 02:48:42 +00:00
Maggie Moss	27302a4932	Fix error suppression syntax in onnx, jit, _dynamo (#166249 ) Ensures pyrefly will only silence one specific error code pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166249 Approved by: https://github.com/oulgen	2025-10-27 02:01:54 +00:00
James Wu	507614ba43	Add GraphModule.recompile_submodules, use for regional inductor (#166002 ) This makes it so that `GraphModule.recompile()` will also recompile any submodules that are also graph modules, which allows us to pass all existing regional inductor tests without skipping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166002 Approved by: https://github.com/oulgen ghstack dependencies: #165996	2025-10-27 01:40:51 +00:00
Dzmitry Huba	86f9f1d0ab	Enable local tensor model for DTensor redistribute tests (#166081 ) Redistribute test exercise extensively various sharding schemes and redistribution between them. These tests uncovered more edge cases that were not supported by the local tensor primarily different flavors of uneven sharding. In order to handle these cases this change implements missing functional collectives and adds support for uneven sharding case where sharding group (ranks) is larger than the size of the dimension being sharded. In the latter case the "missing" shards are represented by zero sized tensors so that the rest of the local tensor machinery can stay oblivious to this special case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166081 Approved by: https://github.com/ezyang	2025-10-26 22:21:43 +00:00
Maggie Moss	154e4d36e9	Fix pyrelfy ignore syntax in distributions and ao (#166248 ) Ensures existing pyrefly ignores only ignore the intended error code pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166248 Approved by: https://github.com/oulgen	2025-10-26 22:13:48 +00:00
Animesh Jain	a2b6afeac5	[dynamo][guards] CLASS_MATCH guard for readability (#166217 ) We were using FUNCTION_MATCH guard for classes. This was very confusing (although correct). Pull Request resolved: https://github.com/pytorch/pytorch/pull/166217 Approved by: https://github.com/jansel	2025-10-26 18:35:27 +00:00
Animesh Jain	262830d86c	[dynamo] Repro for 166238 (#166252 ) xfail repro for https://github.com/pytorch/pytorch/issues/166238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166252 Approved by: https://github.com/XuehaiPan, https://github.com/jansel	2025-10-26 18:34:22 +00:00
James Wu	e4c01011c2	Mark FlexAttentionBackward as cacheable (#165996 ) This probably should have been marked cacheable a long time ago, no reason that it isn't. Test Plan: New regional inductor tests for test_flex_attention now are serializable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165996 Approved by: https://github.com/oulgen, https://github.com/zou3519, https://github.com/drisspg	2025-10-26 14:39:17 +00:00
Yuanyuan Chen	a60d9e1f6d	Fix flake8 B028 warnings (#166224 ) This PR fixes flake8 B028 warning by specifying stacklevel=2 in `warnings.warn`. The advantage is that users can know more contextual information about PyTorch warnings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166224 Approved by: https://github.com/ezyang	2025-10-26 06:18:55 +00:00
Wei Zhang	f863550192	[dtensor] fix incorrect norm calculation for Partial DTensors (#159856 ) The sharding strategies for `aten.linalg_vector_norm` and the optimized `aten._foreach_norm.Scalar` incorrectly assumes the norm operation is always "reduction linear" with respect to its inputs. This bug causes the norm to be computed on local, incomplete data for DTensors with a `Partial(sum)` placement, leading to an inflated result (a sum of norms, rather than the correct norm of the sum). The error can be reproduced with the following script: ```python import os import torch import torch.distributed as dist from torch.distributed.device_mesh import init_device_mesh from torch.distributed.tensor import DTensor, Partial, Replicate, Shard def setup_distributed(): """Initializes the distributed environment.""" rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) dist.init_process_group("nccl") torch.cuda.set_device(local_rank) print(f"Initialized process {rank}/{world_size} on GPU {local_rank}") return rank, world_size rank, world_size = setup_distributed() assert world_size == 2, "Please run with exactly 2 GPUs for this minimal repro." mesh = init_device_mesh("cuda", (world_size,)) if rank == 0: local_partial = torch.tensor([1.0, 3.0], dtype=torch.float32) else: local_partial = torch.tensor([2.0, 1.0], dtype=torch.float32) partial_dtensor = DTensor.from_local(local_partial, mesh, [Partial("sum")]) partial_result = torch.linalg.vector_norm(partial_dtensor) print( f"[Rank {rank}] partial_result: {partial_result}, full_tensor: {partial_result.full_tensor()}" ) shard_dtensor = partial_dtensor.redistribute(mesh, [Shard(0)]) shard_result = torch.linalg.vector_norm(shard_dtensor) print( f"[Rank {rank}] shard_result: {shard_result}, full_tensor {shard_result.full_tensor()}" ) replicate_dtensor = partial_dtensor.redistribute(mesh, [Replicate()]) replicate_result = torch.linalg.vector_norm(replicate_dtensor) print( f"[Rank {rank}] replicate_result: {replicate_result}, full_tensor {replicate_result.full_tensor()}" ) full_tensor = partial_dtensor.full_tensor() full_result = torch.linalg.vector_norm(full_tensor) print(f"[Rank {rank}] correct_result: {full_result}") ``` Run results show that the norm is `sqrt(12 + 32) + sqrt(22 + 12) = sqrt(10) + sqrt(5) = 5.398` instead of `sqrt(32 + 42) = 5`. ``` $ torchrun --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=3.1622776985168457, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ torchrun --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=2.2360680103302, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Partial(sum),)), full_tensor: 5.398345947265625 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh('cuda', [0, 1]), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` This fix simply forces `reduction_linear=False` for partial placements. The output becomes: ``` $ python -m torch.distributed.run --local-ranks-filter 0 --nproc-per-node 2 script.py Initialized process 0/2 on GPU 0 [Rank 0] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 0] shard_result: DTensor(local_tensor=3.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 0] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 0] correct_result: 5.0 $ python -m torch.distributed.run --local-ranks-filter 1 --nproc-per-node 2 script.py Initialized process 1/2 on GPU 1 [Rank 1] partial_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor: 5.0 [Rank 1] shard_result: DTensor(local_tensor=4.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(_NormPartial(reduce_op='sum', norm_type=2),)), full_tensor 5.0 [Rank 1] replicate_result: DTensor(local_tensor=5.0, device_mesh=DeviceMesh((2,), device: 'cuda', stride: (1,)), placements=(Replicate(),)), full_tensor 5.0 [Rank 1] correct_result: 5.0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/159856 Approved by: https://github.com/ezyang	2025-10-26 05:58:44 +00:00
Maggie Moss	84b14f3a10	Fix error suppression syntax in utils and nn (#166242 ) Fixes syntax for pyrefly : ignores so they only ignore a specific category. No functional changes pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166242 Approved by: https://github.com/oulgen, https://github.com/cyyever	2025-10-26 05:21:07 +00:00
Maggie Moss	5121499f6b	Fix pyrefly ignore syntax in /tools/... (#166240 ) Second PR for this - only adjusts the syntax used for the ignores so the suppressions hide only one category of pyrefly errors. test: pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166240 Approved by: https://github.com/oulgen	2025-10-26 04:20:16 +00:00
Maggie Moss	8f80892359	Use correct pyrefly syntax in suppressions distributed/... (#166241 ) Updates the pyrefy-ignores in the torch/distributed directory to use the correct syntax. No functional changes. pyrefly check lintrunner Pull Request resolved: https://github.com/pytorch/pytorch/pull/166241 Approved by: https://github.com/oulgen	2025-10-26 04:16:41 +00:00
Jack Taylor	cdb60e44eb	[Inductor] Naive foreach autotune support (#162053 ) Initial autotuning support for foreach kernels, 4x improvement for some kernels in internal workload. More improvements can surely be made here in the future. Removing num_warps for definition to enable autotune support in generated wrapper code. Before: triton_for_fused_18.kd 🔍 \| 4.986 ms \| 4.986 ms \| 2.493 ms \| 2 \| triton_for_fused_6.kd 🔍 \| 0.098 ms \| 0.098 ms \| 0.049 ms \| 2 \| triton_for_fused_7.kd 🔍 \| 0.036 ms \| 0.036 ms \| 0.018 ms \| 2 \| After: triton_for_fused_18.kd 🔍 \| 1.273 ms \| 1.273 ms \| 0.636 ms \| 2 \| triton_for_fused_6.kd 🔍 \| 0.044 ms \| 0.044 ms \| 0.022 ms \| 2 \| triton_for_fused_7.kd 🔍 \| 0.024 ms \| 0.024 ms \| 0.012 ms \| 2 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/162053 Approved by: https://github.com/mlazos, https://github.com/naromero77amd Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2025-10-26 02:36:15 +00:00
Scott Wolchok	25909d2629	Simplify SingletonOrSharedTypePtr (#166183 ) @neildhar pointed out at PTC yesterday that the assumption SingletonOrSharedTypePtr makes about shared_ptr's pointers being either both null or both non-null is incorrect because of the aliasing constructor, and furthermore that SingletonOrSharedTypePtr needn't be as fancy as it is because said constructor exists. (See also https://github.com/pytorch/pytorch/issues/166152 .) Differential Revision: [D85458769](https://our.internmc.facebook.com/intern/diff/D85458769/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/166183 Approved by: https://github.com/Skylion007, https://github.com/cyyever	2025-10-26 01:25:24 +00:00
Maggie Moss	c7eee49525	Fix pyrefly ignores 1/n (#166239 ) First diff adjusting the syntax for pyrefly: ignore suppressions so they only hide one class of type error. Test: lintrunner pyrefly check Pull Request resolved: https://github.com/pytorch/pytorch/pull/166239 Approved by: https://github.com/oulgen	2025-10-26 00:44:10 +00:00
eqy	621ba05107	[cuDNN][SDPA] Handle `c10:Error` when checking device capability for prefer-cuDNN SDPA check (#166201 ) Fake device test can execute this function when the number of visible CUDA devices is 0, fix to unblock #165922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166201 Approved by: https://github.com/Skylion007	2025-10-25 23:00:06 +00:00
Analle Abuammar	39a70cead1	[feat]: add optimized exp_u20 implementation from Arm Optimized Routi… (#161049 ) This patch adds an optimized exp_u20() implementation, based on Arm Optimized Routines (AOR). The legacy svexp_f32_z function is removed, and internal uses (such as in tanh) now leverage the new exp_u20() logic. Unit tests have been updated to cover all scenarios. The implementation ensures correct handling of edge cases by falling back to exp() for extreme inputs (\|x\| ≥ 0x1.5d5e2ap+6f or \|x\| ≤ -0x1.5d5e2ap+6f). Performance: <html> <body> <!--StartFragment--><h3 data-start="70" data-end="182"><strong data-start="77" data-end="182">Performance Improvements for <code data-start="108" data-end="144">aten::scaled_dot_product_attention</code> (Neoverse-V2, <code data-start="159" data-end="179">OMP_NUM_THREADS=16</code>)</strong></h3> <div class="_tableContainer_1rjym_1"><div tabindex="-1" class="group _tableWrapper_1rjym_13 flex w-fit flex-col-reverse"> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/anaabu01/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/anaabu01/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:"Aptos Narrow", sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} .xl63 {font-weight:700; text-align:center; vertical-align:middle; border:.5pt solid windowtext; white-space:normal;} .xl64 {text-align:center; vertical-align:middle; border:.5pt solid windowtext; white-space:normal;} --> </head> <body link="#467886" vlink="#96607D"> Configuration \| Current \| With Changes (F32) \| Speedup -- \| -- \| -- \| -- Batch 1 · 16 Heads · Seq 512 · Q 128 \| 654.102 µs \| 551.031 µs \| 1.19× faster (≈ 19%) Batch 8 · 64 Heads · Seq 2048 · Q 128 \| 30.308 ms \| 17.142 ms \| 1.77× faster (≈ 43%) </body> </html> </div></div><!--EndFragment--> </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/161049 Approved by: https://github.com/fadara01, https://github.com/jgong5 Co-authored-by: Fadi Arafeh <Fadi.Arafeh@arm.com>	2025-10-25 20:44:11 +00:00
Pawel Swider	d97f6550a2	[Intel GPU] Xpu matmul implementation for complex dtype (#160867 ) Enabling complex datatype support for 4 ops: `mm`, `bmm`, `addmm`, `baddbmm` for XPU. From now implementation will call functions created in: https://github.com/intel/torch-xpu-ops/pull/1992. Additionally added complex datatype tests for matmul operators. More detailed tests are going to be enabled in: https://github.com/intel/torch-xpu-ops/pull/1993 CI runs have found that `test_comprehensive_linalg_eig_xpu` tests were calling internally matmul with complex datatype. With this PR test starts to pass so linalg.eig was removed from `inductor_expected_failures_single_sample["xpu"]` as otherwise it was failing with: `Unexpected success` message. Part of: https://github.com/intel/torch-xpu-ops/issues/1853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160867 Approved by: https://github.com/guangyey, https://github.com/ZhiweiYan-96, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/Silv3S, https://github.com/CuiYifeng, https://github.com/jansel	2025-10-25 17:13:13 +00:00
PyTorch MergeBot	516e58965a	Revert "Export flex attention with kwargs and DTensor (#166045 )" This reverts commit `de7fdfe41a`. Reverted https://github.com/pytorch/pytorch/pull/166045 on behalf of https://github.com/malfet due to Broke distributed tests, see `b55b779ad3/1` ([comment](https://github.com/pytorch/pytorch/pull/166045#issuecomment-3446850955))	2025-10-25 15:47:32 +00:00
Aaron Orenstein	b55b779ad3	Add file size limits to linters and refactor grep_linter (#166202 ) - Add 1GB file size limits to grep_linter, newlines_linter, codespell_linter - Refactor grep_linter - process files once instead of per-line - Extract allowlist check to separate function - Add 512KB limit for computing replacements, 100 match limit per file - Detect duplicate arguments - Fix .lintrunner.toml: RAWCUDADEVICE used --pattern twice Pull Request resolved: https://github.com/pytorch/pytorch/pull/166202 Approved by: https://github.com/Skylion007	2025-10-25 14:57:19 +00:00
Chang Pan	74e53d0761	[TorchScript] clearer debug for ConcreteModuleType::findSubmoduleConcreteType (#166192 ) Summary: right now the log is just ``` RuntimeError: it != data_.modules_.end() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/jit/frontend/concrete_module_type.cpp":207, please report a bug to PyTorch. ``` we have no clue where the error happens https://fb.workplace.com/groups/gpuinference/posts/789257990578348/?comment_id=789284783909002&reply_comment_id=789415260562621 Test Plan: UT Reviewed By: jcwchen Differential Revision: D80020093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166192 Approved by: https://github.com/gmagogsfm	2025-10-25 14:07:54 +00:00
Aidyn-A	798a6d2be1	[Inductor][Autotune] Gracefully restart the autotune process after ULF failure (#166073 ) This PR partially fixes https://github.com/pytorch/torchtitan/issues/1791, as it will work with `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` setting only. The core of the problem: In `max-autotune` mode Inductor runs multiple benchmarks to determine the best config. If one of these benchmarks fails with `cudaErrorLaunchFailure`, all other CUDA calls within the same process will fail including the rest of the benchmarks. The solution: Restart the child process gracefully and continue benchmarking. Unfortunately, if autotuning is done in the main process, the whole program falls into unrecoverable state. In this case, the only way of successful execution would be just preventing the ULF. Here is some info from [CUDA documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html): >cudaErrorLaunchFailure = 719 An exception occurred on the device while executing a kernel. ... . This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166073 Approved by: https://github.com/syed-ahmed, https://github.com/drisspg	2025-10-25 10:40:59 +00:00
Nikita Shulga	b0e9c86971	[MPS] Move hypot to Metal (#166216 ) Which also prevents crashes, when invoked for integer types, for example, before this change following crashes ``` python -c "import torch; print(torch.hypot(torch.randint(0, 10, (3,), device='mps'), torch.randint(0, 10, (3,), device='mps')))" * Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '* -[__NSDictionaryM setObject:forKey:]: object cannot be nil (key: squareRoot_i64)' *** First throw call stack: ( 0 CoreFoundation 0x0000000194d33ae0 __exceptionPreprocess + 176 1 libobjc.A.dylib 0x00000001947f6b90 objc_exception_throw + 88 2 CoreFoundation 0x0000000194c7d884 -[__NSDictionaryM setObject:forKey:] + 1288 3 MPSCore 0x00000001a1187d0c _ZN12MPSKernelDAG15duodenaryCoreOpEP10BaseTensorS1_S1_S1_S1_S1_S1_S1_S1_S1_S1_S1_RKNSt3__16vectorIlNS2_9allocatorIlEEEE11MPSDataTypePKc + 37044 4 MPSCore 0x00000001a113fab0 _ZN12MPSKernelDAGD0Ev + 4256 5 MPSCore 0x00000001a1139f6c _ZN12MPSKernelDAG13getDAGAndHashEPU21objcproto10MTLLibrary11objc_objectP14MPSDAGKernelOpP19NSMutableDictionaryIP8NSStringPU22objcproto11MTLFunction11objc_objectEP14NSMutableArrayIS6_ERDv4_yPb + 8 6 MPSCore 0x00000001a113c7a4 _ZN12MPSKernelDAG13getDAGAndHashEPU21objcproto10MTLLibrary11objc_objectP14MPSDAGKernelOpP19NSMutableDictionaryIP8NSStringPU22objcproto11MTLFunction11objc_objectEP14NSMutableArrayIS6_ERDv4_yPb + 1 7 MPSCore 0x00000001a11c03c8 _ZN10MPSLibrary19CreateUberShaderKeyEP8NSStringRK23MPSFunctionConstantListyPFPU22objcproto11MTLFunction11objc_objectPU21objcproto10MTLLibrary11objc_objectPK13MPSKernelInfoS4_RK33MPSFunctionConstr 8 MPSNDArray 0x00000001a27b546c MPSSetResourcesOnCommandEncoder + 154176 9 MPSNDArray 0x00000001a27967d8 MPSSetResourcesOnCommandEncoder + 28076 10 MPSNDArray 0x00000001a2798ec8 MPSSetResourcesOnCommandEncoder + 38044 11 MetalPerformanceShadersGraph 0x00000001f97689ac _ZN3GPU17IdentityOpHandler15encodeNDArrayOpEPNS_16EncodeDescriptorEP7NSArray + 436 12 MetalPerformanceShadersGraph 0x00000001f977f93c _ZN3GPU17StitchedOpHandler8encodeOpEPNS_16EncodeDescriptorE + 924 13 MetalPerformanceShadersGraph 0x00000001f9544898 _ZN16GPURegionRuntime5runOpIN3GPU23AbsoluteSquareOpHandlerEEEvPN4mlir9OperationEPNS1_16EncodeDescriptorE + 120 14 MetalPerformanceShadersGraph 0x00000001f9543894 _ZN16GPURegionRuntime8encodeOpEPN4mlir9OperationEPN3GPU16EncodeDescriptorE + 4700 15 MetalPerformanceShadersGraph 0x00000001f954251c _ZN16GPURegionRuntime29encodeOpWithCommitAndContinueEPN4mlir9OperationEPN3GPU16EncodeDescriptorE + 92 16 MetalPerformanceShadersGraph 0x00000001f954189c _ZN16GPURegionRuntime11evaluateOpsEPN3GPU16EncodeDescriptorEP7NSArrayIP18MPSGraphTensorDataES7_ + 3572 17 MetalPerformanceShadersGraph 0x00000001f953f7b4 _ZN10MPSRuntime11evaluateOpsEN4mlir4func6FuncOpEP21RuntimeSpecializationP7NSArrayIP18MPSGraphTensorDataES9_P37MPSGraphExecutableExecutionDescriptorP16MPSCommandBufferbbbPb + 824 18 MetalPerformanceShadersGraph 0x00000001f988dd38 -[MPSGraphExecutable runInternalWithDevice:commandBuffer:feeds:results:executableExecutionDescriptor:mpsGraphOwnedCommandBuffer:] + 3848 19 MetalPerformanceShadersGraph 0x00000001f988ca04 -[MPSGraphExecutable runInternalWithDevice:commandBuffer:feedsDictionary:resultsDictionary:executableExecutionDescriptor:mpsGraphOwnedCommandBuffer:] + 608 20 MetalPerformanceShadersGraph 0x00000001f9728aa0 -[MPSGraph runInternalWithMPSCommandBuffer:feeds:targetTensors:targetOperations:resultsDictionary:executionDescriptor:mpsGraphOwnedCommandBuffer:] + 320 21 MetalPerformanceShadersGraph 0x00000001f9727b58 -[MPSGraph encodeToCommandBuffer:feeds:targetOperations:resultsDictionary:executionDescriptor:] + 188 22 libtorch_cpu.dylib 0x00000001556c9478 ___ZN2at3mps9MPSStream15executeMPSGraphEP8MPSGraphP12NSDictionaryS5_NS0_8SyncTypeE_block_invoke + 128 23 libdispatch.dylib 0x0000000194a3985c _dispatch_client_callout + 16 24 libdispatch.dylib 0x0000000194a2f7a8 _dispatch_lane_barrier_sync_invoke_and_complete + 56 25 libtorch_cpu.dylib 0x00000001556c93e0 _ZN2at3mps9MPSStream15executeMPSGraphEP8MPSGraphP12NSDictionaryS5_NS0_8SyncTypeE + 160 26 libtorch_cpu.dylib 0x00000001556fd0f4 _ZN2at6native3mpsL14binaryOpTensorERKNS_6TensorES4_S4_NSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEU13block_pointerFP14MPSGraphTensorPNS1_19BinaryOpCachedGraphESD_SD_E + 3040 27 libtorch_cpu.dylib 0x00000001556ff680 _ZN2at6native24structured_hypot_out_mps4implERKNS_6TensorES4_S4_ + 84 28 libtorch_cpu.dylib 0x00000001522682e4 _ZN2at12_GLOBAL__N_117wrapper_MPS_hypotERKNS_6TensorES3_ + 216 29 libtorch_cpu.dylib 0x0000000153a1378c _ZN3c104impl28wrap_kernel_functor_unboxed_INS0_6detail24WrapFunctionIntoFunctor_INS_26CompileTimeFunctionPointerIFN2at6TensorENS_14DispatchKeySetERKS6_S9_EXadL_ZN5torch8autograd12VariableType12_G 30 libtorch_cpu.dylib 0x0000000151241714 _ZN2at4_ops5hypot4callERKNS_6TensorES4_ + 304 31 libtorch_python.dylib 0x0000000105d9a848 _ZN5torch8autogradL17THPVariable_hypotEP7_objectS2_S2_ + 752 32 Python 0x00000001036afa7c cfunction_call + 72 33 Python 0x000000010365db08 _PyObject_MakeTpCall + 124 34 Python 0x0000000103750f40 _PyEval_EvalFrameDefault + 23304 35 Python 0x000000010374b1c8 PyEval_EvalCode + 184 36 Python 0x00000001037ab8bc run_eval_code_obj + 88 37 Python 0x00000001037a9994 run_mod + 132 38 Python 0x00000001037a8fdc PyRun_StringFlags + 124 39 Python 0x00000001037a8f08 PyRun_SimpleStringFlags + 64 40 Python 0x00000001037cd464 Py_RunMain + 716 41 Python 0x00000001037cd950 pymain_main + 304 42 Python 0x00000001037cd9f0 Py_BytesMain + 40 43 dyld 0x0000000194836b98 start + 6076 ) libc++abi: terminating due to uncaught exception of type NSException ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/166216 Approved by: https://github.com/Skylion007 ghstack dependencies: #166210	2025-10-25 08:51:38 +00:00

1 2 3 4 5 ...

94972 Commits