pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	4ae57d448c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-03 07:33:55 +00:00
xinan.lin	81b7b16618	Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" (#161949 ) This PR reland #161142 which is reverted to be able to revert other PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949 Approved by: https://github.com/jansel	2025-09-02 23:43:27 +00:00
Kurt Mohler	791eff96c8	[MPS] Add `igamma/igammac` ops (#161927 ) Fixes #161725 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161927 Approved by: https://github.com/malfet	2025-09-02 20:52:02 +00:00
PyTorch MergeBot	420c52ecf3	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit `626cb7df81`. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))	2025-09-02 20:24:01 +00:00
Edward Z. Yang	626cb7df81	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-01 23:00:21 +00:00
PyTorch MergeBot	54e275e0d8	Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 )" This reverts commit `c83cbd2f2a`. Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))	2025-09-01 17:03:50 +00:00
Isalia20	f3697b033e	[MPS] add bunch of unary funcs for sparse tensors (#161846 ) adds bunch of unary functions for sparse tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846 Approved by: https://github.com/malfet	2025-08-30 21:13:05 +00:00
xinan.lin	c83cbd2f2a	[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142 ) Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142 Approved by: https://github.com/jansel	2025-08-30 11:09:07 +00:00
Irakli Salia	8627a19adf	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-30 01:09:00 +00:00
Nikita Shulga	7c30a9d7fc	[MPS] Add slow version of `kthvalue` (#161817 ) Which heavily borrows implementation logic from `topk` As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817 Approved by: https://github.com/dcci	2025-08-30 00:44:29 +00:00
Zhang, Liangang	3e459491b5	Enable XPU path for FlexAttention (#143553 ) [#RFC153024](https://github.com/pytorch/pytorch/issues/153024) Motivation 1. The Attention has been the critical performance bottleneck in the current LLM models, and FlexAttention is a good choice to cover the broad variants in the transformers series models. With FlexAttention, it is easy for us to enable the paged attention and fused SDPA in the transformers repo on XPU device. Besides, it also provide a candidate to process attention in LLM ecosystem libraries ., e.g., vLLM, SGLang on XPU device. 2. FlexAttention is good start point to push the intel triton based GEMM kernel to be matured. FlexAttention provide both flexattention kernel and flexdecoding kernel to cover both compute bound and memory bound GEMM computation, and different shapes should also been supported to serve LLM inference., e.g. head_dim=64, 96, 128, 256. What does this PR do? 1. Enable the device type for Flexattention kernel and UTs to ensure all important UTs pass on XPU device. 2. For E2E model inference, ensure the functionality of LLM models inference with FlexAttention to be ready. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143553 Approved by: https://github.com/EikanWang, https://github.com/drisspg Co-authored-by: Mao Yunfei <yunfei.mao@intel.com> Co-authored-by: Xingyuan Li <xingyuan.li@intel.com> Co-authored-by: majing <jing1.ma@intel.com> Co-authored-by: Xiao, Wang <wang.xiao@intel.com>	2025-08-29 23:10:58 +00:00
PyTorch MergeBot	f6368e934e	Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839 )" This reverts commit `93c5112f46`. Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](`93c5112f46`) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))	2025-08-29 19:55:39 +00:00
Irakli Salia	93c5112f46	[MPS] sparse add unary funcs + add for sparse tensors (#160839 ) Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-08-29 16:28:58 +00:00
Xilun Wu	627decb0ed	[DTensor] fix DTensorTestCase.destroy_pg() when device_type is "cpu" but CUDA device is available (#161015 ) Summary When `device_id` is not None, barrier() will choose the accelerator of the most pripority, which means if the test specifies to use CPU for testing while CUDA is available on the host, the barrier() will use CUDA. To avoid this and better respect `self.device_type`, we add this branch to enforce barrier() to use CPU when `self.device_type` is CPU and other accelerator is also available. Test `pytest test/distributed/tensor/test_dtensor_testbase.py` Debugging Output ``` # from init_process_group() init pg: backend=gloo, device_id = None default_pg has backend: gloo, device_types: [device(type='cuda'), device(type='cpu')] # from barrier() barrier: device_ids = [10], devices = [], device = None, PG=[device(type='cuda'), device(type='cpu')] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/161015 Approved by: https://github.com/tianyu-l	2025-08-29 12:47:11 +00:00
Nikita Shulga	2042d2174a	[MPS] Migrate round unary op to Metal (#161712 ) And actually use the right function, as [`torch.round`](https://docs.pytorch.org/docs/stable/generated/torch.round.html) doesn't use `std::round`, but rather `std::rint`, which can be easily seen by running something like ```python import torch print(torch.arange(-3., 3., step=.5, device='mps').round()) print(torch.arange(-3., 3., step=.5, device='mps').cpu().round()) ``` Before this change it printed ``` tensor([-3., -3., -2., -2., -1., -1., 0., 1., 1., 2., 2., 3.], device='mps:0') tensor([-3., -2., -2., -2., -1., -0., 0., 0., 1., 2., 2., 2.]) ``` But after this change results match Pull Request resolved: https://github.com/pytorch/pytorch/pull/161712 Approved by: https://github.com/dcci	2025-08-28 16:45:07 +00:00
xinan.lin	bae01479c3	[Inductor UT] Re-enable test_torchinductor_opinfo.py on XPU. (#161477 ) The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet. To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477 Approved by: https://github.com/jansel	2025-08-28 03:29:21 +00:00
drisspg	30edac5da6	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 23:01:31 +00:00
PyTorch MergeBot	014b98dd09	Revert "Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 )" This reverts commit `77bc959fe1`. Reverted https://github.com/pytorch/pytorch/pull/151314 on behalf of https://github.com/atalman due to sorry change is faling internally ([comment](https://github.com/pytorch/pytorch/pull/151314#issuecomment-3229774015))	2025-08-27 21:21:19 +00:00
PyTorch MergeBot	38ed57d446	Revert "Updates to CuTe DSL template renderer (#161117 )" This reverts commit `1750cc8037`. Reverted https://github.com/pytorch/pytorch/pull/161117 on behalf of https://github.com/atalman due to will need to revert to unblock revert of https://github.com/pytorch/pytorch/pull/151314 ([comment](https://github.com/pytorch/pytorch/pull/161117#issuecomment-3229754295))	2025-08-27 21:17:25 +00:00
drisspg	1750cc8037	Updates to CuTe DSL template renderer (#161117 ) # Summary This adds a few more render functions available to template writers, specifically get_output and modification. The reasons why are more clear in the next PR in this stack. <img width="1645" height="364" alt="Screenshot 2025-08-21 at 1 48 50 PM" src="https://github.com/user-attachments/assets/2d508fda-4273-43ef-9edf-086e592e9249" /> Majority of the new cod is around the OpOverrides for CuTe DSL. It is alot to test and most of the actual testing I have been doing is via score_mods to the flash_attention at the next layer of this stack. A bunch of score mods that me and Claude came up with , that exercise the actual ops. ``` Py def causal_mask(score, b, h, q_idx, kv_idx): """Causal attention mask.""" return torch.where(q_idx >= kv_idx, score, float("-inf")) def relative_bias(score, b, h, token_q, token_kv): """Relative position bias.""" return score + torch.abs(token_q - token_kv) def relative_bias_v2(score, b, h, token_q, token_kv): """Relative position bias with factor of 2.""" return score + 2 * torch.abs(token_q - token_kv) def times_two(score, b, h, q_idx, kv_idx): """Simple score modification that doubles the score.""" return score * 2 def alibi_bias(score, b, h, q_idx, kv_idx): """ALiBi (Attention with Linear Biases) - used in some modern models.""" # Different slopes for different heads slope = 2 ** (-8 * (h + 1) / 8) # Simplified version return score - slope * torch.abs(q_idx - kv_idx) def sliding_window(score, b, h, q_idx, kv_idx, window_size=256): """Sliding window attention - only attend to nearby tokens.""" return torch.where( torch.abs(q_idx - kv_idx) <= window_size, score, float("-inf") ) def block_diagonal(score, b, h, q_idx, kv_idx, block_size=64): """Block diagonal attention pattern.""" q_block = q_idx // block_size kv_block = kv_idx // block_size return torch.where(q_block == kv_block, score, float("-inf")) def additive_bias(score, b, h, q_idx, kv_idx): """Test simple addition with position-based bias.""" return score + (q_idx + kv_idx) * 0.01 def multiplicative_decay(score, b, h, q_idx, kv_idx): """Test multiplication with distance-based decay.""" distance = torch.abs(q_idx - kv_idx) return score * torch.exp(-0.1 * distance) def sine_wave_bias(score, b, h, q_idx, kv_idx): """Test trigonometric functions.""" return score + 0.1 * torch.sin(2 * math.pi * (q_idx - kv_idx) / 64) def log_distance_penalty(score, b, h, q_idx, kv_idx): """Test logarithmic operations.""" distance = torch.abs(q_idx - kv_idx).float() return score - torch.log(1 + distance) def alternating_mask(score, b, h, q_idx, kv_idx): """Test with alternating pattern - good for branch prediction.""" return torch.where((q_idx + kv_idx) % 2 == 0, score, float("-inf")) def head_specific_pattern(score, b, h, q_idx, kv_idx): """Different behavior per attention head.""" even_head = h % 2 == 0 causal = q_idx >= kv_idx return torch.where(even_head & causal, score, float("-inf")) def sparse_strided(score, b, h, q_idx, kv_idx, stride=4): """Sparse attention with strided pattern.""" return torch.where( (kv_idx % stride == 0) \| (q_idx == kv_idx), score, float("-inf") ) def causal_with_global(score, b, h, q_idx, kv_idx): """Causal mask but first few tokens are globally attended.""" is_causal = q_idx >= kv_idx is_global = kv_idx < 4 return torch.where(is_causal \| is_global, score, float("-inf")) def dilated_attention(score, b, h, q_idx, kv_idx, dilation_rate=2): """Dilated attention pattern - exponentially increasing gaps.""" distance = torch.abs(q_idx - kv_idx) is_attended = (distance == 0) \| ((distance > 0) & ((distance & (distance - 1)) == 0)) return torch.where(is_attended, score, float("-inf")) ``` Example outputs: ``` [Test Suite] Config: batch=4, heads=32, seq_q=8192, seq_kv=8192, dim=128 [Test 1: none] [No score_mod, flash='enabled'] Found flash_attncute: True [No score_mod, flash='disabled'] Found flash_attncute: False ✓ Outputs match between flash enabled/disabled ✓ Output matches eager SDPA (rtol=0.001, atol=0.001) [Test 2: causal] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 17879 / 134217728 (0.0%) Greatest absolute difference: 0.0078125 at index (0, 15, 15, 60) (up to 0.001 allowed) Greatest relative difference: 2.5 at index (3, 22, 153, 126) (up to 0.001 allowed) [Test 3: rel_bias] [With score_mod, flash='enabled'] Found flash_attncute: True [With score_mod, flash='disabled'] Found flash_attncute: False ✗ Outputs differ between flash modes: Tensor-likes are not close! Mismatched elements: 12836 / 134217728 (0.0%) Greatest absolute difference: 0.015625 at index (0, 3, 2775, 84) (up to 0.001 allowed) Greatest relative difference: 11.8125 at index (3, 28, 4095, 76) (up to 0.001 allowed) [Test 4: rel_bias_v2] ``` This is bfloat16 and there are no major differences. The list of pointwise ops here isn't exhaustive but it is fairly covering Pull Request resolved: https://github.com/pytorch/pytorch/pull/161117 Approved by: https://github.com/mlazos	2025-08-27 18:39:09 +00:00
PyTorch MergeBot	9f6e1b8730	Revert "[ROCm] SDPA fix mem fault when dropout is enabled (#154864 )" This reverts commit `3caddd4daa`. Reverted https://github.com/pytorch/pytorch/pull/154864 on behalf of https://github.com/atalman due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/154864#issuecomment-3225554119))	2025-08-26 20:03:59 +00:00
Charlie West-Taylor	77bc959fe1	Add inductor backend to device interface; make minifier_tests more device agnostic (#151314 ) Tried to decouple the always cpu <=> c++, cuda <=> triton assumption. Tried to keep it relatively simple by just guarding things more specifically, at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151314 Approved by: https://github.com/eellison	2025-08-26 19:40:37 +00:00
Lucas Kabela	f0e0a6897e	type misc init and tools for dynamo (#161293 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161293 Approved by: https://github.com/anijain2305	2025-08-26 17:38:49 +00:00
David Berard	8c506e6310	[easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445 ) This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed. Notably: * the test has output_size kwarg specified * the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur) Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-08-26 12:32:24 +00:00
Chuanhao Zhuge	e9d42b3880	[small][muon] Use addmm for Newton–Schulz orthogonalization (#161379 ) A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op. Benchmark In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a 1.62× speedup. matmul <img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" /> addmm <img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" /> Testing End-to-end training: We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm. <img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" /> Unit test: ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = Muon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", use_addmm=True, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` shows numeric difference, but this is expected on bf16 precision: ``` Mismatched elements: 96 / 100 (96.0%) Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed) Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed) ``` ~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~ Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161379 Approved by: https://github.com/janeyx99	2025-08-26 09:17:28 +00:00
PyTorch MergeBot	40c0e700a4	Revert "[AMD] Fix AMD User Defined Kernel Autotune (#160671 )" This reverts commit `431846a632`. Reverted https://github.com/pytorch/pytorch/pull/160671 on behalf of https://github.com/atalman due to new test is failing: inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_rocm_triton_autotuning_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/17172795679/job/48725235301) [HUD commit link](`431846a632`) ([comment](https://github.com/pytorch/pytorch/pull/160671#issuecomment-3220442141))	2025-08-25 14:07:48 +00:00
Chuanhao Zhuge	74280d0913	[muon] Introduce Muon optimizer to PyTorch (#160213 ) A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. Usage ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~Additional usage~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. Testing ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> Numerics We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> Performance Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> Note We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. Next Steps 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213 Approved by: https://github.com/janeyx99	2025-08-24 08:03:04 +00:00
Chong Gu	431846a632	[AMD] Fix AMD User Defined Kernel Autotune (#160671 ) Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel. Test Plan: ``` buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR ``` can succeed after this change. Rollback Plan: Differential Revision: D80285441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671 Approved by: https://github.com/muchulee8	2025-08-23 07:23:09 +00:00
Andy Lugo	3caddd4daa	[ROCm] SDPA fix mem fault when dropout is enabled (#154864 ) Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management Pull Request resolved: https://github.com/pytorch/pytorch/pull/154864 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-21 14:23:13 +00:00
Will Constable	1ea918caf9	[C10D] Make MultiProcContinuousTest less spammy (#160821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160821 Approved by: https://github.com/fduwjj ghstack dependencies: #160892	2025-08-19 20:17:19 +00:00
Will Constable	779fc29c04	[C10D] Fix spelling of MultiProcContinuousTest (#160892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892 Approved by: https://github.com/fduwjj	2025-08-19 20:17:19 +00:00
Nikita Shulga	a44a0d3671	[MPS] Fix index_add for complex + int64 (#160926 ) By re-using deterministic algorithm from `bbc7c03e93/aten/src/ATen/native/cuda/Indexing.cu (L1106-L1113)` Fixes https://github.com/pytorch/pytorch/issues/160845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160926 Approved by: https://github.com/manuelcandales ghstack dependencies: #160850, #160889	2025-08-19 17:43:06 +00:00
Wei Wang	16ada80c61	[BE][CUDA][Distributed] Add require_exact_world_size() and a few distributed unit test fixes (#160803 ) 1. Add require_exact_world_size() 2. Decorate the test `test_new_subgroups_with_group_param` with this require_exact_world_size(4) as the test would fail with world_size of 8 when testing with 8xB200 runner. 3. Modify `test_new_subgroups_world_size_not_divisible_by_group_size` so that it will not fail due to 4 vs. 8 mismatch. Doing so makes the test pass with both 4-GPU runner and 8-GPU runner. Separating these changes out from B200 distributed runner PR #159323 Fixes https://github.com/pytorch/pytorch/issues/159987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160803 Approved by: https://github.com/fduwjj	2025-08-18 21:15:33 +00:00
Peter Y. Yeh	e389a08dcd	AMD/ROCm OCP Micro-scaling Format (mx-fp8/mx-fp4) Support (#151360 ) - This pull request introduces support for the [OCP Micro-scaling (MX) format](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), with a focus on compatibility with AMD ROCm 7.0 and the gfx950 architecture. This PR also establishes the foundation for enabling MX-FPX features in [TorchAO](https://github.com/pytorch/ao/issues/2229) on the AMD platform. - Validation (ROCm 7.0 + gfx950 required): `111 relevant tests passing.` > PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Co-author: @jagadish-amd — Thank you for the efforts leading validation on gfx950 with ROCm 7.0. ----------------------------------- This pull request introduces support for new scalar types and scaling methods, particularly for ROCm 7.0 and gfx950, and refines testing for these features. Key changes include adding constraints for matrix dimensions, enabling block-wise scaling, and updating tests to accommodate new data types. ### Support for new scalar types and scaling methods: * [`aten/src/ATen/cuda/CUDABlas.cpp`](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885): Added constraints for matrix dimensions when using `Float8_e8m0fnu` with block-wise scaling, ensuring dimensions are multiples of 32. Updated compatibility checks to support ROCm 7.0 for `Float8_e8m0fnu` and `Float8_e4m3fn`. [[1]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeR1876-R1885) [[2]](diffhunk://#diff-74fcb26047c1df4024105d36ce22a36b77cf8cc93c28631d743e639b3d6066aeL1913-R1934) * [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290): Introduced block-wise scaling for `Float8_e8m0fnu`, with checks for ROCm 7.0 and GPU architecture `gfx950`. Added validation for supported scalar types and matrix dimensions. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1276-R1290) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abR1349-R1364) ### Updates to scalar type mappings: * [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L93-R93): Extended scalar type mappings to support `Float4_e2m1fn_x2` for ROCm 7.0. * [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fR88-R96): Added a constexpr mapping for `Float4_e2m1fn_x2` based on ROCm version. ### Enhancements to testing(@jagadish-amd): * [`test/test_matmul_cuda.py`](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766): Updated tests to include new scalar types (`Float4_e2m1fn_x2`) and recipes (`mxfp4`). Added logic to handle different scaling recipes and validate compatibility with ROCm and CUDA versions. [[1]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23R765-R766) [[2]](diffhunk://#diff-3f31c52b48cfddf8f4617d809f7695b2e4a1c78656f8c4b5143a4b45d01fcf23L1331-R1356) F592e669L1353R1472) These changes improve compatibility with newer hardware and software versions, enhance functionality for matrix operations, and ensure robust testing for the added features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151360 Approved by: https://github.com/drisspg, https://github.com/malfet	2025-08-18 16:43:09 +00:00
Kurt Mohler	6382302990	[MPS] Add `grid_sampler_3d` for MPS (#160541 ) This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation. NOTE: "nearest" interpolation is not yet supported Fixes #159882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541 Approved by: https://github.com/malfet	2025-08-15 16:19:25 +00:00
Nikita Shulga	db0b7f1cc9	[BE][CI] Adjust `error_inputs` for cat and complex (#160378 ) MPS backend does not support double, so errors should be different Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378 Approved by: https://github.com/dcci	2025-08-13 18:35:06 +00:00
PyTorch MergeBot	641ee74781	Revert "Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 )" This reverts commit `f990490a23`. Reverted https://github.com/pytorch/pytorch/pull/150282 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/150282#issuecomment-3182844949))	2025-08-13 09:01:52 +00:00
Deng, Daisy	6e8865fbc1	port 3 distributed test to Intel GPU and unified some common functions (#158533 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - enabled XPU for some test path - Unify some common code under torch/testing/_internal for multiple backend, for example: - requires_nccl_version - _dynamo_dist_per_rank_init - DynamoDistributedSingleProcTestCase - DistTestCases - FSDPTestMultiThread Pull Request resolved: https://github.com/pytorch/pytorch/pull/158533 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-08-13 08:13:23 +00:00
Zain Rizvi	0d71ca2c46	[EZ] Replace `pytorch-labs` with `meta-pytorch` (#160459 ) This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch' ## Changes Made - Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch' - Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml - Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files ## Files Modified This PR updates files that contained the target text. Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459 Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet	2025-08-12 22:44:25 +00:00
zeshengzong	f990490a23	Add `label_smoothing` param in `nn.BCELoss` and `nn.BCEWithLogitsLoss` (#150282 ) Fixes #91545 ## Changes - Add `label_smoothing` param and docs - Add test case for `label_smoothing` - Remove duplicate description in `nn.BCELoss` and `nn.BCEWithLogitsLoss` ## Test Result ```bash pytest -s test/test_nn.py -k test_bce ``` ![image](https://github.com/user-attachments/assets/30c0b7fe-fe49-4aa0-9b05-4d70403a7b05) ![image](https://github.com/user-attachments/assets/4fe3fd1c-54b8-4012-afd9-133ce9fb4964) ![image](https://github.com/user-attachments/assets/5cad019a-3a4c-475a-9fde-9c1acad5792d) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150282 Approved by: https://github.com/cyyever, https://github.com/mikaylagawarecki	2025-08-12 09:37:03 +00:00
PyTorch MergeBot	b149c7204c	Revert "port distributed pipeline test files for Intel GPU (#159033 )" This reverts commit `76a0609b6b`. Reverted https://github.com/pytorch/pytorch/pull/159033 on behalf of https://github.com/clee2000 due to broke test_cpp_extensions_stream_and_event.py::TestCppExtensionStreamAndEvent::test_stream_event [GH job link](https://github.com/pytorch/pytorch/actions/runs/16890370216/job/47849586456) [HUD commit link](`76a0609b6b`) note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/159033#issuecomment-3176833314))	2025-08-11 20:44:45 +00:00
Liao, Wei	76a0609b6b	port distributed pipeline test files for Intel GPU (#159033 ) In this PR we will port all distributed pipeline test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: 1. instantiate_device_type_tests() 2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend 3. use "requires_accelerator_dist_backend()" to replace requires_nccl() 4. use "get_default_backend_for_device()" to get backend 5. enabled XPU for some test path 6. add TEST_MULTIACCELERATOR in common_utils for all backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159033 Approved by: https://github.com/guangyey, https://github.com/d4l3k Co-authored-by: Daisy Deng <daisy.deng@intel.com>	2025-08-11 19:43:15 +00:00
Nikita Shulga	842cc77ab9	[MPS] Extend addmm to integral types (#160270 ) By adding `addmm` kernel, which is a logical continuation of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via ``` PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_ ``` Fixes https://github.com/pytorch/pytorch/issues/154901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #160228, #160234	2025-08-11 00:54:17 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
Nikita Shulga	df55ec7d4b	[OpInfo][BE] Better inputs for addmm (#160234 ) Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234 Approved by: https://github.com/Skylion007 ghstack dependencies: #160228	2025-08-10 01:26:48 +00:00
Nikita Shulga	8c41cb800a	[MPS][BE] Combine all pre-MacOS14 xfail lists (#160228 ) It does not matter whether it started to fail after 13.1 or 13.3, fact that it still fails on latest MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/160228 Approved by: https://github.com/dcci	2025-08-09 00:00:46 +00:00
Nikita Shulga	28ccc9e724	[MPS] Extend `index_put` to complex types (#160159 ) And delete confusing supported types check. Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header Fixes https://github.com/pytorch/pytorch/issues/160034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007	2025-08-08 21:54:30 +00:00
gaoyvfeng	50f23ff6f8	rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883 ) Fixes #159399 "Modified torch.testing._internal.inductor_utils and test/inductor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883 Approved by: https://github.com/janeyx99	2025-08-08 15:44:52 +00:00
Aaron Gokaslan	beb4d7816d	[BE]: ruff PLC0207 - use maxsplit kwarg (#160107 ) Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107 Approved by: https://github.com/albanD	2025-08-08 03:14:59 +00:00
Dev Sashidhar	8cb91e20bc	Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908 ) This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-08-07 11:24:44 +00:00
Xuehai Pan	5cedc5a0ff	[BE][PYFMT] migrate PYFMT for `torch/[p-z]*/` to `ruff format` (#144552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144552 Approved by: https://github.com/ezyang	2025-08-07 00:09:56 +00:00
Mwiza Kunda	0afaeb7c4e	Improve `extract_test_fn` (#158637 ) The current implementation assumes test functions are resolved as test_module.TestClass.test_fn, however this would not work for modules nested in directories e.g. inductor.test_torchinductor.TestClass.test_fn Pull Request resolved: https://github.com/pytorch/pytorch/pull/158637 Approved by: https://github.com/jbschlosser	2025-08-06 20:45:21 +00:00
Will Constable	dad2a05bec	[DTensor] Set up DTensorContinuousTestBase (#159885 ) Also migrate `test_common_rules.py` since it was a short file `python test/distributed/tensor/test_common_rules.py` Before: Ran 10 tests in 91.516s After: Ran 10 tests in 5.604s Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885 Approved by: https://github.com/ezyang	2025-08-06 07:40:31 +00:00
Nikita Shulga	e06b110f73	[Testing] Add MPS to NATIVE_DEVICES (#153835 ) This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely: - Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64` - Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())` - Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed) - Add lots of `expectedFailureIfMPS` - Delete all `@onlyNativeDeviceTypesAnd("mps")` <sarcasm> I love how well documented this variable are </sarcasm> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835 Approved by: https://github.com/Skylion007	2025-08-05 18:57:35 +00:00
Luca Wehrstedt	aeb5321b63	Allow controlling PG backend and options via init_device_mesh (#159371 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159371 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/wanchaol	2025-08-05 12:44:14 +00:00
Kurt Mohler	b59b61a099	Add `avg_pool3d` backward pass for MPS (#159089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089 Approved by: https://github.com/malfet	2025-08-05 01:55:38 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit `310f901a71`. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Kurt Mohler	d4109a0f99	[MPS] Add max_unpool1d/2d/3d (#159789 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789 Approved by: https://github.com/malfet	2025-08-04 20:00:59 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
Chris Thi	c400c8e2e0	[ROCm] Add FP8 rowwise support to _scaled_grouped_mm + Submodule update (#159075 ) Summary: In this PR we integrate the [FBGEMM AMD FP8 rowwise scaling grouped GEMM kernel](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped) to add support for the `_scaled_grouped_mm` API on AMD. `_scaled_grouped_mm` is [currently supported on Nvidia](`9faef3d17c/aten/src/ATen/native/cuda/Blas.cpp (L1614)`), this PR aims to bring parity to AMD. Related: [[RFC]: PyTorch Low-Precision GEMMs Public API](https://github.com/pytorch/pytorch/issues/157950#top) #157950. The kernel is developed using the Composable Kernel framework. Only MI300X is currently supported. In the near future we plan to add support for MI350X as well. For data types we support FP8 e3m4. The kernel support will be gated with the `USE_FBGEMM_GENAI` flag. We hope to enable this by default for relevant AMD builds. Note we also update submodule `third_party/fbgemm` to 0adf62831 for the required updates from fbgemm. Test Plan: Hipify & build ``` python tools/amd_build/build_amd.py USE_FBGEMM_GENAI=1 python setup.py develop ``` Unit tests ``` python test/test_matmul_cuda.py -- TestFP8MatmulCUDA Ran 488 tests in 32.969s OK (skipped=454) ``` Performance Sample \| G \| M \| N \| K \| Runtime Ms \| GB/S \| TFLOPS \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| 128 \| 1 \| 2048 \| 5120 \| 0.37\| 3590 \| 7.17 \| \| 128 \| 64 \| 2048 \| 5120 \| 0.51\| 2792 \| 338.34 \| \| 128 \| 128 \| 2048 \| 5120 \| 0.66\| 2272 \| 522.72 \| \| 128 \| 1 \| 5120 \| 1024 \| 0.21\| 3224 \| 6.43 \| \| 128 \| 64 \| 5120 \| 1024 \| 0.29\| 2590 \| 291.40 \| \| 128 \| 128 \| 5120 \| 1024 \| 0.40\| 2165 \| 434.76 \| \| 128 \| 1 \| 4096 \| 4096 \| 0.69\| 3126 \| 6.25 \| \| 128 \| 64 \| 4096 \| 4096 \| 0.85\| 2655 \| 324.66 \| \| 128 \| 128 \| 4096 \| 4096 \| 1.10\| 2142 \| 501.40 \| \| 128 \| 1 \| 8192 \| 8192 \| 2.45\| 3508 \| 7.01 \| \| 128 \| 64 \| 8192 \| 8192 \| 3.27\| 2692 \| 336.74 \| \| 128 \| 128 \| 8192 \| 8192 \| 4.04\| 2224 \| 543.76 \| \| 16 \| 1 \| 2048 \| 5120 \| 0.04\| 3928 \| 7.85 \| \| 16 \| 64 \| 2048 \| 5120 \| 0.05\| 3295 \| 399.29 \| \| 16 \| 128 \| 2048 \| 5120 \| 0.07\| 2558 \| 588.69 \| \| 16 \| 1 \| 5120 \| 1024 \| 0.03\| 3119 \| 6.23 \| \| 16 \| 64 \| 5120 \| 1024 \| 0.03\| 2849 \| 320.62 \| \| 16 \| 128 \| 5120 \| 1024 \| 0.05\| 2013 \| 404.11 \| \| 16 \| 1 \| 4096 \| 4096 \| 0.06\| 4512 \| 9.02 \| \| 16 \| 64 \| 4096 \| 4096 \| 0.09\| 3124 \| 381.95 \| \| 16 \| 128 \| 4096 \| 4096 \| 0.13\| 2340 \| 547.67 \| \| 16 \| 1 \| 8192 \| 8192 \| 0.32\| 3374 \| 6.75 \| \| 16 \| 64 \| 8192 \| 8192 \| 0.42\| 2593 \| 324.28 \| \| 16 \| 128 \| 8192 \| 8192 \| 0.53\| 2120 \| 518.36 \| - Using ROCm 6.4.1 - Collected through `triton.testing.do_bench_cudagraph` Binary size with gfx942 arch Before: 116103856 Jul 23 14:12 build/lib/libtorch_hip.so After: 118860960 Jul 23 14:29 build/lib/libtorch_hip.so The difference is 2757104 bytes (~2.6 MiB). Reviewers: @drisspg @ngimel @jwfromm @jeffdaily Pull Request resolved: https://github.com/pytorch/pytorch/pull/159075 Approved by: https://github.com/drisspg	2025-07-30 23:53:58 +00:00
Denghui Dong	a775c8e73e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-30 16:35:51 +00:00
Nikita Shulga	15bb81ea4f	[2/N][CI] Remove MacOS-13 workarounds from tests (#159304 ) Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #159277, #159278	2025-07-29 23:12:13 +00:00
Kurt Mohler	52b9af163c	Add `avg_pool3d` for MPS (#158877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158877 Approved by: https://github.com/malfet	2025-07-29 15:22:22 +00:00
PyTorch MergeBot	1cffb217ef	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit `e88f804a2e`. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/XuehaiPan due to Breaks Windows wheels ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3125566269))	2025-07-28 05:29:37 +00:00
Denghui Dong	e88f804a2e	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16	2025-07-25 21:44:57 +00:00
Yidi Wu	4aa69ae336	[torchbind] support register_autocast for torchbind custom op (#158583 ) Fix https://github.com/pytorch/pytorch/issues/158414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158583 Approved by: https://github.com/zou3519	2025-07-25 20:55:41 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
PyTorch MergeBot	b533f12120	Revert "[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 )" This reverts commit `da94023b02`. Reverted https://github.com/pytorch/pytorch/pull/155446 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @sraikund16 can you please help validate the fix? (See D78845227 for details). You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/155446#issuecomment-3115072504))	2025-07-24 21:46:00 +00:00
Luca Wehrstedt	5ab0eb28f7	Support DeepSeek-style blockwise scaling scaled-mm for fp8 on Hopper+ (#158037 ) cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack. The scaling format is still detected from the sizes of the scale tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037 Approved by: https://github.com/eqy, https://github.com/drisspg	2025-07-24 20:10:51 +00:00
Bin Bao	0118931e27	[Inductor] Fix a user-defined Triton kernel bool param codegen issue (#158845 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/158778. When handling a boolean type parameter to a user-defined Triton kernel, we need to treat it differently from integer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158845 Approved by: https://github.com/davidberard98, https://github.com/eellison	2025-07-24 00:19:27 +00:00
Mikayla Gawarecki	7f649ed4f8	Add basic torch.hash_tensor op (#154149 ) Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor. - The hash is always uint64. - Integers will be casted to uint64 before performing the xor_sum reduction - Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149 Approved by: https://github.com/albanD	2025-07-23 22:28:03 +00:00
Denghui Dong	da94023b02	[Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446 ) Hi team, Please help review this patch. This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable. I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by `257c413cd1` on 3.12.5. So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it. There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain. ~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446 Approved by: https://github.com/sraikund16, https://github.com/cyyever	2025-07-23 20:03:52 +00:00
PyTorch MergeBot	30b0ad5c68	Revert "Fix decorators skipping NCCL tests (#158846 )" This reverts commit `57024913c4`. Reverted https://github.com/pytorch/pytorch/pull/158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](`57024913c4`) ([comment](https://github.com/pytorch/pytorch/pull/158846#issuecomment-3109553414))	2025-07-23 17:47:35 +00:00
Mwiza Kunda	d3d9bc1c31	[inductor] Allow backends to register their own custom config object (#158254 ) An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254 Approved by: https://github.com/eellison	2025-07-23 15:56:06 +00:00
Alexander Grund	57024913c4	Fix decorators skipping NCCL tests (#158846 ) Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip` In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup Using `unittest.skip` decorators avoids the starting of the test in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846 Approved by: https://github.com/Skylion007	2025-07-23 13:31:21 +00:00
Electron4444	832ab990c9	Use init_device_mesh API for select tests where possible (#158675 ) This addresses reviews made for: #158538 #108749 It interchanged all the specific DevideMesh constructor calls with the API provided by the test cases, to improve abstraction Pull Request resolved: https://github.com/pytorch/pytorch/pull/158675 Approved by: https://github.com/wconstab	2025-07-22 19:28:42 +00:00
Nikita Shulga	2cdafab0bd	[BE] Raise ValueError from `torch.cat` meta func (#158249 ) Followup after https://github.com/pytorch/pytorch/pull/155460 From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError): > Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError. Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other > Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch. > This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249 Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD	2025-07-20 23:49:18 +00:00
Simon Fan	80ac73c057	[ca] reset between tests (#158418 ) CA reset is much faster than dynamo reset, so it's probably okay to run it every time. I'm not sure if this will fix the flaky autograd tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158418 Approved by: https://github.com/jansel	2025-07-17 20:14:29 +00:00
Jiang, Yanbing	f4d8bc46c7	Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520 ) ### Description This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`. ``` torch.backends.mkldnn.matmul.fp32_precision = 'tf32' torch.backends.mkldnn.conv.fp32_precision = "tf32" ``` Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520 Approved by: https://github.com/mingfeima, https://github.com/jansel	2025-07-17 08:57:34 +00:00
Hari Krishna Sai Kodali	9d184bda2f	add device generalization support for distributed tests (#156796 ) MOTIVATION To generalize Distributed test cases for non-CUDA devices CHANGES - test/distributed/checkpoint/test_fsspec.py - test/distributed/checkpoint/test_state_dict.py - test/distributed/test_multi_threaded_pg.py Replaced hard coded device names with torch.accelerator.current_accelerator - torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py support for hccl backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796 Approved by: https://github.com/guangyey, https://github.com/ezyang	2025-07-16 09:37:03 +00:00
Xilun Wu	add0b450bd	[DTensor][BE] improve DTensor ops correctness check utils (#158112 ) Summary Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is: 1. allowing users to specify the `Partial` placement. 2. supporting tree-like output structure. Test so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`. `pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial` Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112 Approved by: https://github.com/Skylion007, https://github.com/zpcore ghstack dependencies: #158051	2025-07-15 04:50:34 +00:00
Nikita Shulga	9ca080db87	[MPS] Extend atomic operations to all int types (#158179 ) That fixes `index_put(..., accumulate=True)` for all dtypes int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #158064, #158178	2025-07-14 04:25:05 +00:00
Howard Huang	f4406689b8	fix MPCT destroy_pg call (#157952 ) I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952 Approved by: https://github.com/fduwjj ghstack dependencies: #157589	2025-07-12 00:46:19 +00:00
Nikita Shulga	beed033b6e	[MPS] Fix `index_kernel` for large tensors (#158064 ) Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 383.5 \| 379.8 \| 470.9 \| 1232.9 \| 4410.3 __getitem__ (torch.float16, torch.int64) \| 379.6 \| 354.5 \| 533.2 \| 1290.3 \| 4442.2 __getitem__ (torch.float32, torch.int64) \| 360.8 \| 338.6 \| 478.6 \| 1348.9 \| 4870.4 Times are in microseconds (us). ``` and after ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 349.8 \| 330.5 \| 432.6 \| 764.5 \| 1961.2 __getitem__ (torch.float16, torch.int64) \| 342.5 \| 330.7 \| 434.7 \| 741.0 \| 1969.4 __getitem__ (torch.float32, torch.int64) \| 332.2 \| 326.1 \| 445.4 \| 751.3 \| 1972.6 Times are in microseconds (us). ``` While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint Fixes https://github.com/pytorch/pytorch/issues/153560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064 Approved by: https://github.com/dcci	2025-07-11 22:35:44 +00:00
Daisy Deng	8088958793	port 4 dynamo test files to Intel GPU (#157779 ) For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - added XPU support in decorators like @requires_gpu - enabled XPU for some test path - added xfailIfXPU to skip xpu test when there is a bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779 Approved by: https://github.com/guangyey, https://github.com/jansel	2025-07-11 10:11:49 +00:00
Dmitry Rogozhkin	cd80f9a4c3	xpu: support custom ops with torch.library on xpu backend (#152879 ) Fixes: https://github.com/intel/torch-xpu-ops/issues/1626 This PR started enabling of tests for `torch.library`, but more work is needed. Tests are using `torch._custom_ops` deprecated API planned for removal at pytorch 2.6 (not done). I think cleanup of pytorch would be nice before enabling more tests for xpu. `a2ccda3c60/torch/_custom_op/impl.py (L47)` CC: @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/152879 Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/guangyey, https://github.com/albanD	2025-07-11 07:36:04 +00:00
Howard Huang	0f31445139	Add stack trace of exception to MultiProcContinousTest (#157589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589 Approved by: https://github.com/Skylion007	2025-07-08 17:54:35 +00:00
Kurt Mohler	510c398a4f	Add `max_pool3d` backward pass for MPS (#157498 ) Note on backward precision over fp16: A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024) \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format)) Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024) \exp2(-3) \approx 0.002$. The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024) \exp2(-2) \approx 0.002$. Same for $e_b=-1$. So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9 * 0.002$. That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498 Approved by: https://github.com/malfet	2025-07-07 19:46:44 +00:00
Nikita Shulga	a952956d05	Add isnan exit condition to special ops (#157464 ) They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind `7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)` which were tested by `test_compare_cpu` using following values (as sample index 16) `7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)` Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below ``` python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))" tensor(nan) ``` Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7 Before ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 45.580s OK (skipped=72, expected failures=8) ``` Fixes https://github.com/pytorch/pytorch/issues/79528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #157488	2025-07-05 04:19:50 +00:00
Teja	dd3e7170c2	Add async checkpointing impl to experimental checkpointer and add a builder API (#156927 ) 1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support. 2. Adds two conveinient functions to create sync/async checkpointers Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927 Approved by: https://github.com/pradeepfn	2025-07-03 22:49:20 +00:00
Manuel Candales	d56f11a1f2	[MPS] Implement logcumsumexp metal kernel (#156858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858 Approved by: https://github.com/malfet ghstack dependencies: #157512	2025-07-03 18:16:25 +00:00
Manuel Candales	794b95d54b	Enable Half dtype for logcumsumexp_backward (#157512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512 Approved by: https://github.com/malfet	2025-07-03 18:13:38 +00:00
PyTorch MergeBot	c9174a20f7	Revert "[BE] Unskip special ops (#157464 )" This reverts commit `e124a0d88c`. Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](`e124a0d88c`) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))	2025-07-03 15:24:15 +00:00
PyTorch MergeBot	b6276a425f	Revert "[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 )" This reverts commit `9620994067`. Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](`e124a0d88c`) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))	2025-07-03 15:24:15 +00:00
Nikita Shulga	9620994067	[MPS] Add `shifted_chebyshev_polynomial_[tuvw]` (#157488 ) For eager and inductor As for all other chebyshev ops, logic is simply compiled from `94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488 Approved by: https://github.com/dcci ghstack dependencies: #157464	2025-07-02 23:29:35 +00:00
Nikita Shulga	e124a0d88c	[BE] Unskip special ops (#157464 ) They were slow on CUDA-11.3, which has long been gone, let's see if they work now Before ``` $ python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 42.379s OK (skipped=80) ``` Fixes https://github.com/pytorch/pytorch/issues/79528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464 Approved by: https://github.com/Skylion007	2025-07-02 23:16:52 +00:00
Nikita Shulga	5e636d664a	[BE] `@serialTest` decorator must be called (#157388 ) Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates ```python import torch from torch.testing._internal.common_utils import serialTest, run_tests, TestCase class MegaTest(TestCase): @serialTest def test_foo(self): if hasattr(self.test_foo, "pytestmark"): print("foo has attr and it is", self.test_foo.pytestmark) print("foo") @serialTest() def test_bar(self): if hasattr(self.test_bar, "pytestmark"): print("bar has attr and it is", self.test_bar.pytestmark) print("bar") if __name__ == "__main__": run_tests() ``` That will print ``` test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})] bar ok test_foo (__main__.MegaTest.test_foo) ... ok ---------------------------------------------------------------------- Ran 2 tests in 0.013s ``` Added assert that arg is boolean in the decorator to prevent such silent skips in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388 Approved by: https://github.com/clee2000	2025-07-02 19:15:19 +00:00
David Berard	82eefaedd9	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-02 14:02:01 +00:00
PyTorch MergeBot	c553c55be7	Revert "Fix full_like decomposition to preserve strides (#144765 )" This reverts commit `01b0f09931`. Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))	2025-07-02 13:56:03 +00:00
haozhe.zhu	5a2db5152d	allow to use bf16 as fp32 internal precision for mkldnn conv (#126050 ) Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"` ### TestPlan python test/test_mkldnn.py -k conv ### Benchmarking FP32 conv2d vs. BF16 internal computation conv2d on SPR Single core: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 185.5071 \| 83.4749 \| 2.22 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 194.7558 \| 79.1683\| 2.46 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 1.9213 \| 1.3690 \| 1.40 56 cores: Input \| fp32 ms \| bf16 internal ms \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 6.5804 \| 7.4349 \| 0.89 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 4.9940 \| 3.8093 \| 1.31 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 8.8359 \| 5.5802 \| 1.58 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 16.5800 \| 9.2367 \| 1.80 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 79.5436 \| 38.3861 \| 2.07 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-07-02 01:31:23 +00:00
PyTorch MergeBot	ab6cb34480	Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 )" This reverts commit `563fd95563`. Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))	2025-07-01 23:21:37 +00:00
David Berard	563fd95563	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-01 22:51:11 +00:00
Isuru Fernando	01b0f09931	Fix full_like decomposition to preserve strides (#144765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765 Approved by: https://github.com/amjames, https://github.com/jansel	2025-07-01 19:13:22 +00:00
Patryk Ozga	e959dd017d	[TSAN][live speech translation] Fix A data race in caffe2 (#156378 ) Summary: noticed that context quantized_engine is accessed and written from multiple threads Test Plan: ➜ fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en' Rollback Plan: Differential Revision: D76921416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156378 Approved by: https://github.com/jerryzh168, https://github.com/cyyever	2025-06-29 07:23:20 +00:00
Nikita Shulga	a1e4f1f98a	[MPS] Reimplement `tri[ul]` as Metal shaders (#157179 ) And add in-place flavor, as it is currently broken for non-contig tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179 Approved by: https://github.com/dcci	2025-06-28 01:33:18 +00:00
Ryan Guo	a4b59498c5	Fix fake kernel for the `out=...` variant of `unbind_copy` (#156643 ) `unbind_copy(..., out=...)` returns None rather than the `out` argument (see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222), but the old fake kernel didn't account for that and caused an assertion failure in `pushPyOutToStack`. This patch fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #156642	2025-06-27 01:34:07 +00:00
Kurt Mohler	e0447bb5f8	Add `max_pool3d` for MPS (#156467 ) Fixes #100674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156467 Approved by: https://github.com/malfet	2025-06-26 23:33:50 +00:00
haozhe.zhu	53e0b9c393	refine fp32 precision api (#125888 ) Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32 internal computation data types . Instead, we will directly use the algorithm to represent it. ### Design Choice: Directly use algorithms name like "TF32", "BF16". #### Pros - The names are more informative. 'tf32' is more informative than a simple "high". - Easier to extend new algorithm like `tf32x3` #### Cons - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them. ### We provide a layered structure for backends/operators. ('f32' is short for 'fp32_precision') ![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067) ### We provide 3 fp32 compute precision can be set: - "ieee": Not allowed to use any other internal computation data types . - "tf32": Allowed to use tf32 as internal computation data types. - "bf16": Allowed to use bf16 as internal computation data types. - "none": Precision's are not set. Can be override by its father node. ### Overriding Precision Settings Child node can be override by its father node if it is set to default. For current default settings: ``` backend = generic, op = all, precision setting = none backend = cuda, op = all, precision setting = none backend = cuda, op = conv, precision setting = tf32 backend = cuda, op = rnn, precision setting = tf32 backend = cuda, op = matmul, precision setting = none backend = matmul, op = all, precision setting = none backend = matmul, op = conv, precision setting = none backend = matmul, op = rnn, precision setting = none backend = matmul, op = matmul, precision setting = none ``` - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16". - If the user set `torch.backends.fp32_precision="bf16"`, `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16". ### Backward Compatible Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is - If the user only uses previous APIs, it will work as previous expectations. - If the user use new API to change the status to an un-representable status for old API, and try to access the status by old API. We will raise Runtime Error and point the document for user. ### Test Plan ``` python test/test_cuda.py -k test_fp32_precision_with_tf32 python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision python test/test_cuda.py -k test_invalid_status_for_legacy_api python test/test_mkldnn.py -k test_mlkdnn_get_set python test/test_mkldnn.py -k test_generic_precision python test/test_mkldnn.py -k test_invalid python test/test_mkldnn.py -k test_default_use_parent ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888 Approved by: https://github.com/jgong5, https://github.com/albanD Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>	2025-06-26 10:32:20 +00:00
fduwjj	4585c33e74	[symm_mem] Fix nccl test for symm mem (#156752 ) Try not to call set_device to Fixes #156569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752 Approved by: https://github.com/kwen2501	2025-06-26 02:59:38 +00:00
Xinya Zhang	d9577df312	[ROCm] Bump AOTriton to 0.10b (#156499 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd	2025-06-25 07:09:03 +00:00
Manuel Candales	2d7e6c6241	[MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708 Approved by: https://github.com/malfet	2025-06-24 18:12:18 +00:00
Dmitry Nikolaev	6e17315cd3	Skip FSDP tests if device count is less then requested world_size value (#155836 ) Usually `world_size=torch.cuda.device_count()` for FSDPTest-based tests But distributed test class `TestFullyShardAllGatherExtensionsMultiProcess` [forces to use `world_size=2`](`0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L170)`) even for 1 GPU. Then NCCL fails with errors: ``` HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity ... ncclInvalidUsage: This usually reflects invalid usage of NCCL library. Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000 Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000 ``` The test method [has `@skip_if_lt_x_gpu(2)` decorator](`0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L209)`), but test fails during test class initialization before decorator activation This PR will skip FSDPtest-based tests if `world_size > torch.cuda.device_count()` ``` HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity ... dist init r=0, world=2 dist init r=1, world=2 SKIPPED [15.5507s] (Need at least 2 CUDA devices) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155836 Approved by: https://github.com/jeffdaily	2025-06-24 16:38:23 +00:00
Aby Mathew C	6835ba1b34	Register hpu device to fake backend (#156076 ) ## MOTIVATION This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created. ## CHANGES - Add "hpu" to the list of devices @ankurneog, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076 Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD	2025-06-23 16:08:08 +00:00
Xuehai Pan	cec2977ed2	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-23 02:57:34 +00:00
PyTorch MergeBot	3f44fdc03d	Revert "[BE][6/16] fix typos in torch/ (#156316 )" This reverts commit `b210cf1ea5`. Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Sidharth	aeaf6b59e2	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-22 11:39:31 +00:00
Xuehai Pan	b210cf1ea5	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-22 08:43:33 +00:00
Simon Fan	f1968a5e76	[ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374 ) These aren't supported. Not sure how they passed CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374 Approved by: https://github.com/jansel	2025-06-21 18:33:38 +00:00
atalman	a47ca4fc74	Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033 )" (#156546 ) Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](`9de23d0c29`) This reverts commit `9de23d0c29`. PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546 Approved by: https://github.com/jansel	2025-06-21 14:10:12 +00:00
Sidharth	9de23d0c29	[dynamo] Weblink generation when unimplemented_v2() is called (#156033 ) This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033 Approved by: https://github.com/williamwen42	2025-06-21 05:47:54 +00:00
PyTorch MergeBot	1036f6d114	Revert "[ROCm] Bump AOTriton to 0.10b (#156290 )" This reverts commit `34d8e64ef6`. Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))	2025-06-20 15:35:25 +00:00
Hari Krishna Sai Kodali	e1f28fe17b	add device generalisation support for distributed tests (#152471 ) ### MOTIVATION To generalize Distributed test cases for non-CUDA devices ### CHANGES - test/distributed/optim/test_zero_redundancy_optimizer.py - test/distributed/test_c10d_logger.py - test/distributed/test_compute_comm_reordering.py Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp. DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions. - torch/testing/_internal/common_distributed.py extended common utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471 Approved by: https://github.com/d4l3k	2025-06-20 07:35:42 +00:00
Nikita Shulga	4cbbc8b458	[MPS] Implement backward pass for interpolate_trilinear (#156373 ) Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373 Approved by: https://github.com/dcci ghstack dependencies: #156375	2025-06-20 05:41:24 +00:00
Xinya Zhang	34d8e64ef6	[ROCm] Bump AOTriton to 0.10b (#156290 ) Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b: * Official support of gfx950/gfx1201 * Experimental support of gfx1101/gfx1151/gfx1150/gfx1200 * Reduce libaotriton.so binary size by over 80%. + Without this optimization the binary size of `libaotriton.so` could be over 100MiB due to 2x more supported architectures compared with 0.9b. Now it is only about 11MiB. * Support sliding window attention (SWA) in `_flash_attention_forward/backward`. Should fix #154582 See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details, including Known Problems. Notable changes to SDPA backend: * `std::optional<int64_t>` `window_size_left/right` are directly passed to ROCM's SDPA backend, because the default value `-1` is meaningful to AOTriton's backend and bottom-right aligned causal mask is implemented with negative `window_size_left/right` * Some code clean up around `USE_CK_FLASH_ATTENTION` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily	2025-06-19 21:13:58 +00:00
Nikita Shulga	c2f4cc59a7	[MPS] Fix bug in 3d coords calculation (#156375 ) Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate` Though it's indirectly tested by `test_upsample_nearest3d` inductor test Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375 Approved by: https://github.com/atalman	2025-06-19 19:56:15 +00:00
Nikita Shulga	36f7a027b5	[MPS] Implement upsample_trilinear as Metal shader (#156263 ) But only forward for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263 Approved by: https://github.com/dcci ghstack dependencies: #156256, #156090	2025-06-18 16:10:02 +00:00
Nikita Shulga	c28e74e457	[MPS] Add nearest_3d forward and backward (#156090 ) Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS Delete `upsample_nearest3d` MPS fallback and replace it with proper shader Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090 Approved by: https://github.com/kulinseth, https://github.com/dcci ghstack dependencies: #156256	2025-06-18 04:48:15 +00:00
Daniel Galvez	9ed0060225	Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164 ) There are a few considerations here: 1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404 2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine. 3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications. 4. There are two foot guns: a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()). b. The following seuquence won't work as intended: ``` g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): foo() g.replay() raw_graph = g.raw_cuda_graph() modify(raw_graph) g.replay() ``` This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive. I think these two foot guns are probably okay given that this a bit of an experts' API. Fixes #155106 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164 Approved by: https://github.com/ngimel	2025-06-18 03:39:28 +00:00
Manuel Candales	a4ea242edc	[MPS] Implement scan metal kernels (#156100 ) Implements metal kernels for scan operations: - Migrates cumsum and cumprod from MPSGraph implementation to Metal. - Fixes #154881 - Adds MPS backend support for cummin and cummax Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100 Approved by: https://github.com/malfet	2025-06-17 17:44:22 +00:00
Nikita Shulga	b1713c6655	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci ghstack dependencies: #156121	2025-06-17 04:46:26 +00:00
PyTorch MergeBot	03488d820c	Revert "[MPS][Testing][BE] Fix samples for full_like (#156026 )" This reverts commit `2d832c9587`. Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](`2d832c9587`) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))	2025-06-16 19:50:26 +00:00
Simon Fan	907d0931cc	[ca] default on in CI, with fallback for tests in test/compiled_autograd_skips/ (#155480 ) For every test that is ran with PYTORCH_TEST_WITH_DYNAMO=1, turn on compiled autograd via config if it is not skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/155480 Approved by: https://github.com/jansel ghstack dependencies: #155521	2025-06-16 18:45:03 +00:00
Nikita Shulga	2d832c9587	[MPS][Testing][BE] Fix samples for full_like (#156026 ) Now that device is known, one can avoid creating tensors of `torch.double` type Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026 Approved by: https://github.com/dcci	2025-06-16 14:27:42 +00:00
Nikita Shulga	831c9010c7	[BE] Remove non-existing operator from unimplemented list (#156025 ) Never heard of torch.login :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156025 Approved by: https://github.com/dcci	2025-06-16 14:14:58 +00:00
David Berard	c83041cac2	[test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827 ) Tests added: ``` python test/inductor/test_triton_kernels.py -k test_on_device_tma python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma ``` These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827 Approved by: https://github.com/FindHao ghstack dependencies: #155777, #155814	2025-06-15 20:24:19 +00:00
Howard Huang	8e1471bdc9	Allow MultiProcContinuousTest to set world_size (#155920 ) `MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920 Approved by: https://github.com/fduwjj	2025-06-15 00:24:17 +00:00
Aaron Orenstein	e95e8eed0a	mypy 1.16.0 (#155821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155821 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-06-14 18:18:43 +00:00
Marcin Pioch	ce79056471	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-14 17:29:54 +00:00
Sean McGovern	297805fd8f	Typo fixes for "overridden" in comments and function names (#155944 ) This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944 Approved by: https://github.com/Skylion007	2025-06-14 03:37:38 +00:00
Nikita Shulga	fec571cfd4	[BE][CI] Remove hardshrink integer exclusions (#155965 ) As they are not called anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/155965 Approved by: https://github.com/dcci	2025-06-14 00:32:57 +00:00
Prachi Gupta	f2b44424a1	[ROCm] Skip _stress_cuda and test_ddp_apply_optim_in_backward (#155724 ) These tests are flaky on ROCm and have been skipped via Github issues, but the bot keeps closing the issues after not observing the failures for these tests in the rerun_disabled_tests runs (not sure why they don't fail there), and we have to keep reopening them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155724 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2025-06-12 21:18:04 +00:00
Manuel Candales	f8d93b3783	[MPS] Migrate hardswish (forward and backward) to Metal kernel (#155479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155479 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: #155304, #155316, #155462	2025-06-11 20:58:46 +00:00
David Berard	9328a7fb58	[triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510 ) This splits out the tests so we can independently test both the new and old API. Note: the new API doesn't work yet - we still need to fix those tests. Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510 Approved by: https://github.com/oulgen	2025-06-11 13:52:15 +00:00
Kurt Mohler	013cf1e330	[MPS] Move expm1 op to Metal (#155611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155611 Approved by: https://github.com/malfet	2025-06-11 13:06:14 +00:00
Oguz Ulgen	0e2013a12d	Add helion x pt2 test (#155513 ) This kinda just worked out of the box, shocking. PT2 traced into helion and emitted it as a user defined triton kernel: P1836496774 In the long run, we do not actually want this, but rather to create a helion HOP so we can do fusions etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155513 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-06-11 07:08:06 +00:00
Siddharth Kotapati	2161be8497	Move unary trig ops to metal kernels (#154465 ) Move inverse trig unary ops, sinh, & cosh to metal kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/154465 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-06-10 22:56:59 +00:00
atalman	7a03b0d2ca	[BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555 ) Please see: https://github.com/pytorch/pytorch/issues/147383 1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4 2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-06-10 21:32:08 +00:00
redwrasse	8a22551300	Fixes OpInfo gradient checks for ctc_loss (#154590 ) Fixes #67462 Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent. The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590 Approved by: https://github.com/soulitzer	2025-06-10 19:56:39 +00:00
Manuel Candales	0f47e76937	[MPS] Implement hardshrink metal kernel (#155304 ) Implements the forward and backward hardshrink operators as Metal kernels. In order to support the lambda parameter, we extend the `exec_unary_kernel` and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar. We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304 Approved by: https://github.com/malfet	2025-06-10 18:20:27 +00:00
Kurt Mohler	e7698ff5cf	[MPS] Move abs op to Metal (#155474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155474 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-10 00:23:59 +00:00
Roy Hvaara	b95dadd717	[MPS] Enable RProp test for non-contiguous (#155439 ) I believe this issue has already been fixed, but I don't know the hero PR. I'm relying on ci signals to verify it's fixed across macOS versions. Fixes #118117 xref #115350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155439 Approved by: https://github.com/Skylion007	2025-06-09 21:29:09 +00:00
Roy Hvaara	3490a4f906	[MPS] Enable optimizer tests affected by addcdiv (#155437 ) Tracked in #118115. Fixed in #124442. This PR unskips the tests. xref #115350 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155437 Approved by: https://github.com/Skylion007	2025-06-09 21:27:37 +00:00
Nikita Shulga	abbdf9f363	[BE][Testing] Unskip `ones_like`/`zeros_like` testing on MPS (#155476 ) But skip `double` dtype form OpInfo variants for this test Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-06-09 20:37:44 +00:00
PyTorch MergeBot	79bdafe5b6	Revert "Custom FX pass for inductor's backend registration (#154841 )" This reverts commit `e694280d12`. Reverted https://github.com/pytorch/pytorch/pull/154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](https://github.com/pytorch/pytorch/pull/154841#issuecomment-2956357711))	2025-06-09 16:56:45 +00:00
Nikita Shulga	f140fac8dc	[MPS] Implement erfc (#155382 ) And migrate `erf` to Metal kernel Use `erf` approximations from https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/erf.h as previous approximation did not match the CPU implementation After that, `erfc(x) := 1.0 - erf(x)` Fixes https://github.com/pytorch/pytorch/issues/155337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155382 Approved by: https://github.com/manuelcandales, https://github.com/dcci	2025-06-07 02:35:12 +00:00
Marcin Pioch	e694280d12	Custom FX pass for inductor's backend registration (#154841 ) This PR is related to RFC #153532. It is an extension to Inductor's backend registration interface to allow to register custom FX passes by the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154841 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2025-06-06 06:49:44 +00:00
Narek Malkhasyan	7999735d23	[CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320 ) Fixes #153133 Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic. The fix introduces double precision validation for checking whether the step sign is consistent with the range direction. This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320 Approved by: https://github.com/malfet	2025-06-05 14:51:25 +00:00
PyTorch MergeBot	20912673a6	Revert "Add __main__ guards to jit tests (#154725 )" This reverts commit `1a55fb0ee8`. Reverted https://github.com/pytorch/pytorch/pull/154725 on behalf of https://github.com/malfet due to This added 2nd copy of raise_on_run to common_utils.py which caused lint failures, see https://github.com/pytorch/pytorch/actions/runs/15445374980/job/43473457466 ([comment](https://github.com/pytorch/pytorch/pull/154725#issuecomment-2940503905))	2025-06-04 15:42:52 +00:00
Anthony Barbier	1a55fb0ee8	Add __main__ guards to jit tests (#154725 ) This PR is part of a series attempting to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs. In jit tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154725 Approved by: https://github.com/Skylion007	2025-06-04 14:44:08 +00:00
Anthony Barbier	c8d44a2296	Add __main__ guards to fx tests (#154715 ) This PR is part of a series attempting to re-submit #134592 as smaller PRs. In fx tests: - Add and use a common raise_on_run_directly method for when a user runs a test file directly which should not be run this way. Print the file which the user should have run. - Raise a RuntimeError on tests which have been disabled (not run) - Remove any remaining uses of "unittest.main()"" Pull Request resolved: https://github.com/pytorch/pytorch/pull/154715 Approved by: https://github.com/Skylion007	2025-06-04 14:38:50 +00:00
Nikita Shulga	9f39028629	[MPS][BE] Move sigmoid op to Metal (#155080 ) Fixes https://github.com/pytorch/pytorch/issues/154895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155080 Approved by: https://github.com/dcci, https://github.com/cyyever ghstack dependencies: #154936, #155002, #155081	2025-06-04 03:28:11 +00:00
Natalia Gimelshein	34e3930401	fix numpy compatibility for 2d small list indices (#154806 ) Will fix #119548 and linked issues once we switch from warning to the new behavior, but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive. We will change the behavior after 2.8 branch is cut. Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD	2025-06-04 01:58:52 +00:00
Nikita Shulga	f714599c57	[MPS][BE] Extend torch.special. to integer dtypes (#155002 ) By changing the functor to looks as follows ```metal struct xlog1py_functor { template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true> inline T operator()(const T a, const T b) { return static_cast<T>(c10:🤘:xlog1py(a, b)); } template <typename T, enable_if_t<is_integral_v<T>, bool> = true> inline float operator()(const T a, const T b) { return c10:🤘:xlog1py(float(a), float(b)); } }; ``` Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/155002 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #154936	2025-06-03 17:52:41 +00:00
Nikita Shulga	e9266f807a	[BE] Use vendored packaging for testing (#154946 ) As the rest of the torch uses it, test should rely on it as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2025-06-03 14:22:53 +00:00
Nikita Shulga	9cdce682a1	[MPS][BE] Reimplement log1p as Metal shader (#154936 ) That should make it faster than MPSGraph implementation, but also improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$ Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} - \frac{x}{3}))$ Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one. Parametrize and modify regression test to check for accuracy of small values TODOs: - Do proper implementation for complex values as well, perhaps using `0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)` - May be implement it using Remez-like algorithm documented here `207f3b2b25/lib/msun/src/s_log1pf.c (L37)` - Or use llvm's implementation from `f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)` - Benchmark which algorithm is faster and delivers better accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-06-03 14:10:13 +00:00
Guilherme Leobas	731e635c95	Add CPython math/cmath tests (#150794 ) Tests: * test_math.py * test_cmath.py Minor changes were made to each test to run them inside Dynamo One can reproduce the changes by downloading the tests from CPython and applying the diff: ```bash for f in "test_math" "test_cmath"; do wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py" git apply "test/dynamo/cpython/3_13/${f}.diff" done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150794 Approved by: https://github.com/zou3519	2025-06-02 20:49:44 +00:00
Joona Havukainen	981bdb39ca	Enable ConvTranspose3D for FP32 and Complex64 (#154696 ) Fixes #154615 Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15. For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs. Tests for ConvTranspose3D were enabled for the supported data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696 Approved by: https://github.com/malfet	2025-06-02 16:24:03 +00:00
Chen Lai	9371491529	[Reland][pytorch] Patch the _is_conv_node function (#154473 ) Summary: Add the conv padding ops in pytorch, the corresponding pr in torch ao is https://github.com/pytorch/ao/pull/2257 Test Plan: ``` buck test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_conv_padding_bn_relu (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)' ``` Differential Revision: D75494468 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154473 Approved by: https://github.com/Skylion007	2025-05-30 00:41:03 +00:00
Nikita Shulga	d6cb0fe576	[MPS] Extend index_copy support to complex dtypes (#154671 ) Should have noticed it during the review Pull Request resolved: https://github.com/pytorch/pytorch/pull/154671 Approved by: https://github.com/dcci ghstack dependencies: #154670	2025-05-30 00:28:13 +00:00
Isalia20	41092cb86c	[MPS] index copy impl (#154326 ) Second most requested op according to #154052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326 Approved by: https://github.com/malfet	2025-05-29 16:57:43 +00:00
Xuehai Pan	7ae204c3b6	[BE][CI][Easy] Run `lintrunner` on generated `.pyi` stub files (#150732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732 Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste	2025-05-27 14:58:02 +00:00
Yuanhao Ji	0a7eef140b	Add `torch.Tensor._make_wrapper_subclass` to `torch/_C/__init__.pyi` (#154022 ) Fixes #153790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022 Approved by: https://github.com/Skylion007	2025-05-27 14:10:00 +00:00
Pat Vignola	70d12ccc3f	[Torch] Fix error message formatting in fp8 comparison logic (#153647 ) Summary: Using `\` includes all the tabs from the next line in the error message. Test Plan: Nothing, simply error message fixing Reviewed By: exclamaforte Differential Revision: D74539234 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153647 Approved by: https://github.com/exclamaforte	2025-05-27 02:51:05 +00:00
Ke Wen	8c16d0e404	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-26 00:56:05 +00:00
PyTorch MergeBot	54932d865e	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit `03e102dbe8`. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))	2025-05-25 13:17:27 +00:00
Ke Wen	9d922b55ef	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-25 03:49:29 +00:00
Ke Wen	03e102dbe8	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-25 03:48:34 +00:00
Aaron Orenstein	6503b4a96e	Update to using mypy 1.15 (#154054 ) The BC break isn't real - mypy decided to start complaining about the way we were typing that function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154054 Approved by: https://github.com/Skylion007	2025-05-24 04:30:57 +00:00
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit `499a76b844`. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
PyTorch MergeBot	561a11aa68	Revert "Patch the _is_conv_node function (#153749 )" This reverts commit `c985cec5b2`. Reverted https://github.com/pytorch/pytorch/pull/153749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153749#issuecomment-2905504697))	2025-05-23 19:04:20 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
PyTorch MergeBot	59c5fff2aa	Revert "[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 )" This reverts commit `a79e621c1c`. Reverted https://github.com/pytorch/pytorch/pull/153404 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/153404#issuecomment-2902741300))	2025-05-22 22:26:59 +00:00
Chen Lai	c985cec5b2	Patch the _is_conv_node function (#153749 ) Summary: torch.ops.aten.conv2d.padding is also conv2d node Differential Revision: D74898941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153749 Approved by: https://github.com/andrewor14, https://github.com/Skylion007	2025-05-22 22:17:02 +00:00
Michael Lazos	423fc671e9	[Cutlass] Support float8_e4m3fn GEMM (#153890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153890 Approved by: https://github.com/drisspg, https://github.com/eellison	2025-05-22 08:37:33 +00:00
soulitzer	f2af30fee5	Add a HOP to bypass tracing of a wrapper function while tracing the wrapped function (#153487 ) Usage: ```python from torch._higher_order_ops.wrap import dynamo_bypassing_wrapper # Your ordinary function wrapper def my_hop_fn_impl(fn, args, k=1, kwargs): def wrapper(args, *kwargs): out = fn(args, *kwargs) if isinstance(out, tuple): return (out[0] + k,) return out + k return wrapper # Calling `my_hop_fn` instead of the impl directly captures a HOP into the dynamo graph def my_hop_fn(fn, args, k=1, *kwargs): return dynamo_bypassing_wrapper( functools.partial(my_hop_fn_impl, k=k), fn, args, **kwargs ) ``` Notes: - The dynamo captured graph now stashes arbitrary callable objects (the wrapper_fn) - this is equivalent to what SAC does today with policy_fn. - The `wrapper_fn` passed to `dynamo_bypassing_wrapper ` should have signature `Callable -> Callable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153487 Approved by: https://github.com/ydwu4	2025-05-22 04:24:38 +00:00
Slawomir Siwek	3742b7fb3a	Treat dim=[] same as dim=None (#153570 ) Fixes https://github.com/pytorch/pytorch/issues/153568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153570 Approved by: https://github.com/ngimel	2025-05-20 22:44:29 +00:00
Joel Schlosser	3ecd444004	Support independent builds for cpp extension tests + apply to libtorch_agnostic tests (#153264 ) Related: #148920 This PR: * Provides a helper `install_cpp_extension(extension_root)` for building C++ extensions. This is intended to be used in `TestMyCppExtension.setUpClass()` * Updates libtorch_agnostic tests to use this * Deletes preexisting libtorch_agnostic tests from `test/test_cpp_extensions_aot.py` * Fixes `run_test.py` to actually run tests in `test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py` to avoid losing coverage. This wasn't being run due to logic excluding tests that start with "cpp"; this is fixed now After this PR, it is now possible to run: ``` python test/cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py ``` and the test will build the `libtorch_agnostic` extension before running the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153264 Approved by: https://github.com/janeyx99	2025-05-20 19:18:09 +00:00
Yanli Zhao	a79e621c1c	[DDP] rebuilt bucket order when find_unused_parameters=true (#153404 ) Differential Revision: D72437251 Enable to rebuild bucket order when find_unused_parameters=true. It should be always better than not rebuilding bucket order when find_unused_parameters=True: 1. for cases where bucket order in the first iteration is the same as the parameter order, rebuilding bucket order will not change anything 2. for cases where bucket order in the first iteration is not the same as the parameter order, there could be two cases: a. bucket order will not change after 1st iteration even the graph is dynamic and there is unused parameter, in this case, rebuilding bucket order will have performance gain b. bucket order change after 1st iteration due to dynamic graph, in this case, both parameter order and bucket order in 1st iteration are not ideal, so rebuilding bucket order or not does not matter it can help case 2.a if enabling to rebuild bucket order when find_unused_parameters=true. meanwhile it will not hurt other cases in 1 and 2.b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153404 Approved by: https://github.com/rohan-varma, https://github.com/fegin	2025-05-20 02:45:01 +00:00
PyTorch MergeBot	674a85cf26	Revert "[Distributed][CI] Rework continuous TestCase (#153653 )" This reverts commit `0d5c628a6e`. Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))	2025-05-19 18:29:27 +00:00
Ke Wen	0d5c628a6e	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-19 18:20:42 +00:00
Tsung-Hsien Lee	6487ea30b3	[c10d] Fix `new_subgroups(group=)` bug (#153798 ) Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call. Test Plan: contbuild & OSS CI Differential Revision: D74964213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798 Approved by: https://github.com/Skylion007	2025-05-19 17:01:10 +00:00
Nikita Shulga	6fe5d9215f	[EZ][MPS] Enable rsub op (#153786 ) Nothing really to enable, just add it to native functions, TensorIterator abstraction takes care of the rest Pull Request resolved: https://github.com/pytorch/pytorch/pull/153786 Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/dcci	2025-05-19 02:01:48 +00:00
Nikita Shulga	62d8e3cb40	[BE][MPS] Cleanup log ops migration (#153727 ) Introduced by https://github.com/pytorch/pytorch/pull/153398 Workaround internal compiler error on MacOS-13 by providing boolean specialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/153727 Approved by: https://github.com/Skylion007	2025-05-16 19:32:17 +00:00
PyTorch MergeBot	3443627e07	Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473 )" This reverts commit `4f4ecc583e`. Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))	2025-05-16 08:29:26 +00:00
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
Pat Vignola	6e107899da	[Torch] Fix crash when comparing fp8 tensors that have more than 1 dimension (#153508 ) Summary: `torch.nonzero` returns as many items as the number of dimensions, so we shouldn't expect a single element for the indices. Test Plan: CI Differential Revision: D74539233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153508 Approved by: https://github.com/exclamaforte	2025-05-15 08:41:46 +00:00
Aby Mathew C	6a28cc826f	Add TEST_HPU flag to set device type (#153461 ) MOTIVATION This PR includes a minor change to check for TEST_HPU flag as well before falling back to CPU. Without this flag, some tests were falling back to CPU causing them to fail. Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66 CHANGES add TEST_HPU flag to some of the conditions checking the environment use DEVICE_COUNT variable instead of torch.accelerator.device_count() API since the later is not supported on out-of-tree devices like Intel Gaudi. @ankurneog , @EikanWang , @cyyever , @guangyey Pull Request resolved: https://github.com/pytorch/pytorch/pull/153461 Approved by: https://github.com/EikanWang, https://github.com/cyyever, https://github.com/albanD	2025-05-14 19:31:40 +00:00

... 2 3 4 5 6 ...

5938 Commits