Commit Graph

5708 Commits

Author SHA1 Message Date
Hari Krishna Sai Kodali
9d184bda2f add device generalization support for distributed tests (#156796)
MOTIVATION
To generalize Distributed test cases for non-CUDA devices

CHANGES

- test/distributed/checkpoint/test_fsspec.py
- test/distributed/checkpoint/test_state_dict.py
- test/distributed/test_multi_threaded_pg.py

Replaced hard coded device names with torch.accelerator.current_accelerator

- torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py

support for hccl backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156796
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-07-16 09:37:03 +00:00
Xilun Wu
add0b450bd [DTensor][BE] improve DTensor ops correctness check utils (#158112)
**Summary**
Implemented the test pattern described in https://github.com/pytorch/pytorch/pull/157991#discussion_r2196363170 as a util method in `DTensorTestBase`. The difference to `DTensorTestBase._test_op` is:
1. allowing users to specify the `Partial` placement.
2. supporting tree-like output structure.

**Test**
so far only adopt `DTensorTestBase._test_op_on_dtensor` in `DistTensorOpsTest.test_split_on_partial`.
`pytest test/distributed/tensor/test_tensor_ops.py -s -k test_split_on_partial`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158112
Approved by: https://github.com/Skylion007, https://github.com/zpcore
ghstack dependencies: #158051
2025-07-15 04:50:34 +00:00
Nikita Shulga
9ca080db87 [MPS] Extend atomic operations to all int types (#158179)
That fixes `index_put(..., accumulate=True)` for all dtypes

int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179
Approved by: https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #158064, #158178
2025-07-14 04:25:05 +00:00
Howard Huang
f4406689b8 fix MPCT destroy_pg call (#157952)
I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952
Approved by: https://github.com/fduwjj
ghstack dependencies: #157589
2025-07-12 00:46:19 +00:00
Nikita Shulga
beed033b6e [MPS] Fix index_kernel for large tensors (#158064)
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator

Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   383.5    |    379.8     |    470.9     |     1232.9     |     4410.3
      __getitem__ (torch.float16, torch.int64)  |   379.6    |    354.5     |    533.2     |     1290.3     |     4442.2
      __getitem__ (torch.float32, torch.int64)  |   360.8    |    338.6     |    478.6     |     1348.9     |     4870.4

Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   349.8    |    330.5     |    432.6     |     764.5      |     1961.2
      __getitem__ (torch.float16, torch.int64)  |   342.5    |    330.7     |    434.7     |     741.0      |     1969.4
      __getitem__ (torch.float32, torch.int64)  |   332.2    |    326.1     |    445.4     |     751.3      |     1972.6

Times are in microseconds (us).
```

While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint

Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
2025-07-11 22:35:44 +00:00
Daisy Deng
8088958793 port 4 dynamo test files to Intel GPU (#157779)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path
- added xfailIfXPU to skip xpu test when there is a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-07-11 10:11:49 +00:00
Dmitry Rogozhkin
cd80f9a4c3 xpu: support custom ops with torch.library on xpu backend (#152879)
Fixes: https://github.com/intel/torch-xpu-ops/issues/1626

This PR started enabling of tests for `torch.library`, but more work is needed. Tests are using `torch._custom_ops` deprecated API planned for removal at pytorch 2.6 (not done). I think cleanup of pytorch would be nice before enabling more tests for xpu.
a2ccda3c60/torch/_custom_op/impl.py (L47)

CC: @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152879
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/guangyey, https://github.com/albanD
2025-07-11 07:36:04 +00:00
Howard Huang
0f31445139 Add stack trace of exception to MultiProcContinousTest (#157589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589
Approved by: https://github.com/Skylion007
2025-07-08 17:54:35 +00:00
Kurt Mohler
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
Nikita Shulga
a952956d05 Add isnan exit condition to special ops (#157464)
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
2025-07-05 04:19:50 +00:00
Teja
dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
Manuel Candales
d56f11a1f2 [MPS] Implement logcumsumexp metal kernel (#156858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858
Approved by: https://github.com/malfet
ghstack dependencies: #157512
2025-07-03 18:16:25 +00:00
Manuel Candales
794b95d54b Enable Half dtype for logcumsumexp_backward (#157512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512
Approved by: https://github.com/malfet
2025-07-03 18:13:38 +00:00
PyTorch MergeBot
c9174a20f7 Revert "[BE] Unskip special ops (#157464)"
This reverts commit e124a0d88c.

Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
PyTorch MergeBot
b6276a425f Revert "[MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)"
This reverts commit 9620994067.

Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
Nikita Shulga
9620994067 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
ghstack dependencies: #157464
2025-07-02 23:29:35 +00:00
Nikita Shulga
e124a0d88c [BE] Unskip special ops (#157464)
They were slow on CUDA-11.3, which has long been gone, let's see if they work now

Before
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 42.379s

OK (skipped=80)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007
2025-07-02 23:16:52 +00:00
Nikita Shulga
5e636d664a [BE] @serialTest decorator must be called (#157388)
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase

class MegaTest(TestCase):
    @serialTest
    def test_foo(self):
        if hasattr(self.test_foo, "pytestmark"):
            print("foo has attr and it is", self.test_foo.pytestmark)
        print("foo")

    @serialTest()
    def test_bar(self):
        if hasattr(self.test_bar, "pytestmark"):
            print("bar has attr and it is", self.test_bar.pytestmark)
        print("bar")

if __name__ == "__main__":
    run_tests()
```

That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.013s

```

Added assert that arg is boolean in the decorator to prevent such silent skips in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
2025-07-02 19:15:19 +00:00
David Berard
82eefaedd9 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-02 14:02:01 +00:00
PyTorch MergeBot
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
haozhe.zhu
5a2db5152d allow to use bf16 as fp32 internal precision for mkldnn conv (#126050)
Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d vs. BF16 internal computation conv2d on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804  | 7.4349 | 0.89
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940  | 3.8093 | 1.31
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861  | 2.07

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:31:23 +00:00
PyTorch MergeBot
ab6cb34480 Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)"
This reverts commit 563fd95563.

Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))
2025-07-01 23:21:37 +00:00
David Berard
563fd95563 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-01 22:51:11 +00:00
Isuru Fernando
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
Patryk Ozga
e959dd017d [TSAN][live speech translation] Fix A data race in caffe2 (#156378)
Summary: noticed that context quantized_engine is accessed and written from multiple threads

Test Plan:
➜  fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en'

Rollback Plan:

Differential Revision: D76921416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156378
Approved by: https://github.com/jerryzh168, https://github.com/cyyever
2025-06-29 07:23:20 +00:00
Nikita Shulga
a1e4f1f98a [MPS] Reimplement tri[ul] as Metal shaders (#157179)
And add in-place flavor, as it is currently broken for non-contig tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179
Approved by: https://github.com/dcci
2025-06-28 01:33:18 +00:00
Ryan Guo
a4b59498c5 Fix fake kernel for the out=... variant of unbind_copy (#156643)
`unbind_copy(..., out=...)` returns None rather than the `out` argument
(see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222),
but the old fake kernel didn't account for that and caused an assertion
failure in `pushPyOutToStack`. This patch fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #156642
2025-06-27 01:34:07 +00:00
Kurt Mohler
e0447bb5f8 Add max_pool3d for MPS (#156467)
Fixes #100674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156467
Approved by: https://github.com/malfet
2025-06-26 23:33:50 +00:00
haozhe.zhu
53e0b9c393 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-06-26 10:32:20 +00:00
fduwjj
4585c33e74 [symm_mem] Fix nccl test for symm mem (#156752)
Try not to call set_device to Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752
Approved by: https://github.com/kwen2501
2025-06-26 02:59:38 +00:00
Xinya Zhang
d9577df312 [ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-25 07:09:03 +00:00
Manuel Candales
2d7e6c6241 [MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708
Approved by: https://github.com/malfet
2025-06-24 18:12:18 +00:00
Dmitry Nikolaev
6e17315cd3 Skip FSDP tests if device count is less then requested world_size value (#155836)
Usually `world_size=torch.cuda.device_count()` for FSDPTest-based tests
But distributed test class `TestFullyShardAllGatherExtensionsMultiProcess` [forces to use `world_size=2`](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L170)) even for 1 GPU.

Then NCCL fails with errors:
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000
```
The test method [has `@skip_if_lt_x_gpu(2)` decorator](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L209)), but test fails during test class initialization before decorator activation

This PR will skip FSDPtest-based tests if `world_size > torch.cuda.device_count()`
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155836
Approved by: https://github.com/jeffdaily
2025-06-24 16:38:23 +00:00
Aby Mathew C
6835ba1b34 Register hpu device to fake backend (#156076)
## MOTIVATION

This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created.

## CHANGES
- Add "hpu" to the list of devices

@ankurneog, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076
Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD
2025-06-23 16:08:08 +00:00
Xuehai Pan
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
PyTorch MergeBot
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea5.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
Sidharth
aeaf6b59e2 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-22 11:39:31 +00:00
Xuehai Pan
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
Simon Fan
f1968a5e76 [ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374)
These aren't supported. Not sure how they passed CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374
Approved by: https://github.com/jansel
2025-06-21 18:33:38 +00:00
atalman
a47ca4fc74 Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033)" (#156546)
Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](9de23d0c29)

This reverts commit 9de23d0c29.

PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546
Approved by: https://github.com/jansel
2025-06-21 14:10:12 +00:00
Sidharth
9de23d0c29 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-21 05:47:54 +00:00
PyTorch MergeBot
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef6.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
Hari Krishna Sai Kodali
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
Nikita Shulga
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
Xinya Zhang
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
Nikita Shulga
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
Nikita Shulga
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
Nikita Shulga
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
Daniel Galvez
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
Manuel Candales
a4ea242edc [MPS] Implement scan metal kernels (#156100)
Implements metal kernels for scan operations:
- Migrates cumsum and cumprod from MPSGraph implementation to Metal.
- Fixes #154881
- Adds MPS backend support for cummin and cummax

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156100
Approved by: https://github.com/malfet
2025-06-17 17:44:22 +00:00