Commit Graph

680 Commits

Author SHA1 Message Date
Xuehai Pan
7ae204c3b6 [BE][CI][Easy] Run lintrunner on generated .pyi stub files (#150732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste
2025-05-27 14:58:02 +00:00
Nikita Shulga
975bbc63db [MPS][BE] Move fmod/remainder to Metal ops (#154280)
This accomplishes following:
 - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
 - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
 - Eliminates need for several correctness workarounds

Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
2025-05-24 01:45:33 +00:00
Nikita Shulga
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
Nikita Shulga
d5ddc5ab20 [MPS] Fix float64 scalar tensor handling (#153582)
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor

Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
2025-05-15 05:15:14 +00:00
Nikita Shulga
8749fe8439 [CI][MPS] Speedup test_large_bmm (#153562)
By computing matmuls of only one random non-zero batch on CPU

This reduces test runtime from 11 minutes to 14 sec
```
 % python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok

----------------------------------------------------------------------
Ran 2 tests in 27.495s

```

TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
2025-05-14 18:49:42 +00:00
Isalia20
56492bfcb9 [MPS] SDPA specialized kernels (#152781)
Paritally fixes #139668 and #152550

Still work in progress. Following needs to be addressed:
- [x] Some tests are failing and need to check why and bugfix
- [x] Benchmark the new kernels and  add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels)
- [x] Add tests to cover the specialized paths(if applicable)
- [x] Code cleanup

**Tested on Macbook M1 Pro**
### Vector Fast Path (q_len=1, k_len=256)
- Old: 0.378 ms
- New: 0.260 ms
- **31.2% speed improvement**

### Vector 2-pass (q_len=1, k_len=4096)
- Old: 0.627 ms
- New: 0.370 ms
- **41.0% speed improvement**

### Vector Fast Path (q_len=8, k_len=256)
- Old: 0.545 ms
- New: 0.322 ms
- **40.9% speed improvement**

### Vector 2-pass (q_len=8, k_len=4096)
- Old: 1.318 ms
- New: 1.057 ms
- **19.8% speed improvement**

Script to get perf:
```
import torch
import time

def benchmark_sdpa(config, iterations=100):
    device = config.get("device", "cpu")
    batch = config["batch"]
    heads = config["heads"]
    q_len = config["q_len"]
    k_len = config["k_len"]
    head_dim = config["head_dim"]

    q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32)
    k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)
    v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)

    for _ in range(5):
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()

    total_time = 0.0
    for i in range(iterations):
        start = time.perf_counter()
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()
        end = time.perf_counter()
        total_time += end - start

    avg_time = total_time / iterations
    print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations")
    return avg_time

def main():
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Running benchmarks on device: {device}")

    benchmarks = [
        {
            "name": "Vector Fast - Small q_len & moderate k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length triggers vector fast path
            "k_len": 256,    # moderate key length
            "head_dim": 64,
            "device": device,
        },
        {
            "name": "Vector 2-pass - Small q_len & long k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length
            "k_len": 4096,   # long key length triggers the 2-pass variant
            "head_dim": 64,
            "device": device,
        },
        # {
        #     "name": "Full Attention - Moderate q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # longer query sequence length
        #     "k_len": 8192,    # matching key length for full attention paths
        #     "head_dim": 64,
        #     "device": device,
        # },
        # {
        #     "name": "Full Attention - Longer q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # very long sequence length
        #     "k_len": 8192,
        #     "head_dim": 64,
        #     "device": device,
        # },
    ]

    iterations = 100
    for config in benchmarks:
        benchmark_sdpa(config, iterations=iterations)

if __name__ == "__main__":
    main()

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781
Approved by: https://github.com/malfet
2025-05-07 00:40:11 +00:00
Nikita Shulga
0ffd31dc8a [MPS] Migrate div roudning modes (#152758)
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24.,  3.]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
2025-05-05 03:02:29 +00:00
Isalia20
99c42722f6 [MPS] fix memory leak in sdpa float32 (#152371)
Fixes #152344

Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory.

Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-29 04:51:10 +00:00
Isalia20
899eec665c [MPS] col2im kernel implementation (#152282)
Fixes #151820
Also requested in #141287

Mainly based on the cuda kernel implementations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282
Approved by: https://github.com/malfet
2025-04-28 03:48:41 +00:00
Nikita Shulga
3ef6d6924a [BE] Switch TestConsistency to MPS device (#147893)
Which will eventually allow move decorators away more `common_mps.py`

Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893
Approved by: https://github.com/atalman
ghstack dependencies: #152204
2025-04-26 01:19:21 +00:00
Isalia20
5e9bdc9b86 [MPS] layernorm forward kernel (#152010)
Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below:
![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e)

Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script
```python
import csv
import time

import numpy as np

import torch
import torch.nn.functional as F

matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
batch_sizes = [1]
elementwise_affine = [False, True]
num_runs = 50
warmup_runs = 3

def create_input_tensor(n, batch_size):
    torch.manual_seed(42)
    return torch.randn(batch_size, n, dtype=torch.float32)

def run_layer_norm(A, normalized_shape, elementwise_affine):
    torch.mps.synchronize()
    start = time.perf_counter()
    out = F.layer_norm(A, normalized_shape)
    torch.mps.synchronize()
    end = time.perf_counter()
    return out, end - start

results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []}

for el_aff in elementwise_affine:
    for n in matrix_sizes:
        for batch_size in batch_sizes:
            print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}")

            try:
                A_cpu = create_input_tensor(n, batch_size)
                A_mps = A_cpu.to("mps")

                normalized_shape = (n,)

                for _ in range(warmup_runs):
                    _, _ = run_layer_norm(A_mps, normalized_shape, el_aff)

                times = []
                for _ in range(num_runs):
                    _, t = run_layer_norm(A_mps, normalized_shape, el_aff)
                    times.append(t)

                mean_time = np.mean(times)
                std_time = np.std(times)

                results["N"].append(n)
                results["elementwise_affine"].append(el_aff)
                results["batch_size"].append(batch_size)
                results["mean_time"].append(mean_time)
                results["std_time"].append(std_time)

                print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

            except RuntimeError as e:
                print(f"Error for N={n}, batch_size={batch_size}: {e}")
                continue

with open("layernorm_benchmark_times_new.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"])
    for i in range(len(results["N"])):
        writer.writerow(
            [
                results["N"][i],
                results["elementwise_affine"][i],
                results["batch_size"][i],
                results["mean_time"][i],
                results["std_time"][i],
            ]
        )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:07:46 +00:00
Nikita Shulga
3aecf2dc52 [MPS] Extend index_put to half precision floats (#151869)
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support

TODOs:
 - Get rid of indexing kernel and compute it directly when kernel is run
 - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
 - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-04-22 22:00:08 +00:00
Li-Huai (Allan) Lin
fbd29527d8 [MPS] Move ops modifiers to testing utils so other tests can reuse (#151781)
Test collection check:
```
python -m pytest test/test_mps.py --collect-only
```
Before:
```
6390 tests collected in 8.34s
```

After:
```
6390 tests collected in 7.71s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781
Approved by: https://github.com/malfet
2025-04-22 19:19:52 +00:00
Nikita Shulga
f37e138bc4 [MPS] Enable log1p and sigmoid for int64 (#151791)
It works on MacOS-15, but likely will need a skip for MacOS-13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791
Approved by: https://github.com/Skylion007
ghstack dependencies: #151790
2025-04-21 18:30:04 +00:00
Davide Italiano
470132c6a1 [MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-20 17:44:40 +00:00
Nikita Shulga
14293c2377 [MPS] Allow isin for mixed types (#151600)
To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed

- Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes
```
>>> import torch
>>> x=torch.arange(4.0, device='mps')
>>> y=torch.arange(1.0, 3.0, device='mps')
>>> x, y, torch.isin(x, y), torch.isin(y, x)
(tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False,  True, False, False], device='mps:0'), tensor([False, False], device='mps:0'))
>>> torch.__version__
'2.6.0'
```
- Cleanup code a bit

Fixes https://github.com/pytorch/pytorch/issues/151443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth
2025-04-18 12:30:32 +00:00
Nikita Shulga
1ffaa00ad7 [MPS] Migrate bitwise_not to unary operator (#151460)
That kills to birds with one stone:
 - Makes implementations more standartized (and faster for strided inputs/outputs)
 - Fixes bug strided inplace bitwise_not

I.e. before this change
```python
import torch
x=torch.arange(32, device="mps")
x[::2].bitwise_not_()
print(x)
```
produced
```
tensor([ -1,  -2,  -3,  -4,  -5,  -6,  -7,  -8,  -9, -10, -11, -12, -13, -14,
        -15, -16,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31], device='mps:0')
```
after, it generates reasonable output
```
tensor([ -1,   1,  -3,   3,  -5,   5,  -7,   7,  -9,   9, -11,  11, -13,  13,
        -15,  15, -17,  17, -19,  19, -21,  21, -23,  23, -25,  25, -27,  27,
        -29,  29, -31,  31], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460
Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007
2025-04-16 21:34:45 +00:00
Nikita Shulga
b8a2824755 [MPS] Fix logit output for half/bfloat (#151282)
Which also fixes MPSInductor pointwise test
TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272
2025-04-15 06:25:00 +00:00
Li-Huai (Allan) Lin
ddfc14b3ae [MPS] Fix where (#151176)
Fixes #150967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-04-13 20:44:50 +00:00
Nikita Shulga
bc47d539fc [MPS] Support ArgumentBuffer bindings from C++/Python (#150780)
To workaround limitation of 32-arguments per kernel and being able to eventually compile something like
```python
import torch

def foo(*args):
  rc = torch.empty_like(args[0])
  for arg in args:
      rc += arg
  return rc

tensors = torch.rand(100, 32, device='mps').unbind(0)
print(torch.compile(foo)(*tensors))
```

For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU.

Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer):
> To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them.

Add both C++ and PyThon unittests that validate that this works.
Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780
Approved by: https://github.com/dcci
2025-04-09 04:24:37 +00:00
Isalia20
49f6cce736 [MPS] grad scaler (#150255)
Fixes #142397

Basic implementation is done. What's left:
- [x] Different dtype/device tensors in the TensorList
- [x] fast path for grouping the foreach kernel
- [x] Tests

Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.

By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
`instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)`

This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 17:06:55 +00:00
Isalia20
cfea55dbec [MPS] fix inverse bug for N>1024 (#146754)
Fixes #138200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-05 21:49:21 +00:00
Nikita Shulga
7ac8186851 [MPSInductor] Speedup sum/prod reductions (#150566)
By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions

Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows:
|size| before | after |
|------------------------|------------|-------------|
| 512x512         | 202.1       | 131.8       |
| 1024x1024   |   780.6    | 176.9       |
| 2048x2048    |   1423.4       | 339.9      |
| 4096x4097    |    2982.2 | 1047.2      |

Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running
```python
import torch
lib=torch.mps.compile_shader("""
kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) {
  out[idx] = metal::simd_shuffle_down(in[idx], 8);
}
""")
x=torch.arange(22, device='mps', dtype=torch.int32)
y=torch.empty_like(x)
lib.do_sum(y, x)
print(y)
```
that returns following on M4
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,  0,  0,  0,  0, 0,  0,  0,  0], device='mps:0', dtype=torch.int32)
```
but same kernel running on M1 returns
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32)
```
This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566
Approved by: https://github.com/manuelcandales
ghstack dependencies: #150452, #150457
2025-04-05 02:47:27 +00:00
Nikita Shulga
827b730f4e [CI] Skip test_copy_large_tensor on M2-15 runners (#150377)
They have more than 12Gb memory, but may be running this test causes OOM in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377
Approved by: https://github.com/atalman
2025-04-01 02:33:43 +00:00
Davide Italiano
b48505a8a1 [MPS] Add support for hermite_polynomial_h. (#150279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-03-31 23:30:19 +00:00
Nikita Shulga
7c65911b11 [MPS] Fix dot/mm for conj_tensors (#150157)
- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci
2025-03-28 20:36:44 +00:00
Nikita Shulga
ef1cb6b646 [BE] Suppress user_warnings while running opinfo tests (#150115)
Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115
Approved by: https://github.com/dcci
ghstack dependencies: #150060
2025-03-27 22:36:27 +00:00
Nikita Shulga
6aca002d82 [MPS] Add chebyshev_polynomial_[uvw] (#150060)
For both eager and inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060
Approved by: https://github.com/dcci, https://github.com/jansel
2025-03-26 23:35:05 +00:00
Nikita Shulga
de68ddc68e [MPS] Fix metal ops with different dtypes (#149974)
By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes.

Been dreaded to write this PR for a while, as it end up to be pretty bulky:
 - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType`
 - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values
 - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types
 - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op

Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.)

|                                           |  dense-dense  |  transp-transp  |  dense-transp  |  transp-dense  |  dense-scalar  |  dense-bcast |
|-------------------------|---------------|----------------|----------------|----------------|---------------|--------------- |
|      fmax (torch.float16, torch.float16)  |     160.9     |      159.9      |     270.5      |     270.9      |     236.6      |     293.0
|      fmax (torch.float32, torch.float32)  |     176.9     |      171.0      |     273.7      |     293.5      |     242.6      |     294.2
|      fmax (torch.float32, torch.float16)  |     171.4     |      170.9      |     283.6      |     303.0      |     253.7      |     302.3
|      add (torch.float16, torch.float16)   |     218.0     |      223.6      |     221.0      |     222.0      |     214.9      |     218.3
|      add (torch.float32, torch.float32)   |     227.4     |      233.9      |     228.8      |     231.9      |     218.9      |     221.4
|      add (torch.float32, torch.float16)   |     226.1     |      227.5      |     227.5      |     226.9      |     177.0      |     190.8

TODOS:
 - Include input and output dtype in non-cast kernel name
 - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR`
- Extend mixed_dytpes testing via OpInfo

Fixes https://github.com/pytorch/pytorch/issues/149951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974
Approved by: https://github.com/manuelcandales
2025-03-26 07:03:21 +00:00
Isalia20
ba46643df1 [MPS] tril op not handling infs correctly (#149866)
Fixes #149813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866
Approved by: https://github.com/malfet
2025-03-24 23:38:41 +00:00
Davide Italiano
9179178728 [MPS] Add support for chebyshev_polynomial_t in eager. (#149816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816
Approved by: https://github.com/malfet
2025-03-24 19:19:55 +00:00
Isalia20
248487f455 [MPS] nanmedian with dims (#149680)
Third most voted op from #77764

Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-24 03:49:16 +00:00
Davide Italiano
b9a5e1d038 [MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783)
Another day another op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783
Approved by: https://github.com/malfet
2025-03-22 02:13:41 +00:00
Davide Italiano
bdc132d0e1 [MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705
Approved by: https://github.com/malfet
2025-03-21 16:14:29 +00:00
Davide Italiano
0ed34210b2 [MPS] Add support for modified_bessel_k1 to eager and inductor. (#149687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687
Approved by: https://github.com/malfet
2025-03-21 04:59:06 +00:00
Isalia20
95e71765f2 [MPS] nanmedian implementation (#149407)
Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim`
Will implement nanmedian with dim and keepdim in a followup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407
Approved by: https://github.com/malfet
2025-03-20 03:50:26 +00:00
Davide Italiano
88c2fe533f [MPS] Add modified_bessel_k0 support to eager. (#149563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563
Approved by: https://github.com/malfet
2025-03-19 23:10:55 +00:00
Nikita Shulga
2e0c98ff05 [MPS] Add bicubic2d_aa (#149378)
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287

Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)

Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa

At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
Davide Italiano
c43e35d6f7 [MPS] Implement support for modified_bessel_i1 in eager. (#149368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 03:29:10 +00:00
Davide Italiano
186cc7327c [MPS/BE] Remove decorator that skipped test on macOS 12. (#149365)
macOS 12 is not really supported anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365
Approved by: https://github.com/malfet
2025-03-18 00:58:08 +00:00
Davide Italiano
9f33c6f0a0 [MPS] Add support for modified_bessel_i0 in eager. (#149264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-16 04:45:49 +00:00
Nikita Shulga
96795e9533 [BE] Parametrize TestMPS.test_binops_dtype_precedence (#149234)
No op change, just splits a longer tests into a series of a smaller ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234
Approved by: https://github.com/atalman, https://github.com/dcci
ghstack dependencies: #149216, #149233
2025-03-15 00:37:11 +00:00
Isalia20
dd6e9df3d0 [MPS] fix attention enable_gqa crash on mps (#149147)
Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet
2025-03-14 21:25:54 +00:00
Nikita Shulga
f2221b2fce [MPS] Add support for i1e (#149203)
Followup after https://github.com/pytorch/pytorch/pull/149174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203
Approved by: https://github.com/dcci
2025-03-14 17:33:52 +00:00
cyy
a9aae05a6b Remove test decorations on MacOS 12 (#148942)
MacOS 12 may reach EOL, as from https://endoflife.date/macos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942
Approved by: https://github.com/malfet
2025-03-14 17:22:37 +00:00
Davide Italiano
706c22549c [MPS] Add support for i0e in eager. (#149174)
Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-14 14:43:46 +00:00
PyTorch MergeBot
be4e6c1c8e Revert "[MPS] Add support for i0e in eager. (#149174)"
This reverts commit b4745db904.

Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))
2025-03-14 06:35:01 +00:00
Nikita Shulga
db6d72213b [MPS] Add torch.special.bessel_[jy][01] implementations (#149123)
By copy-n-pasting functions from
f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)

With an  ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 /  https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-03-14 05:13:55 +00:00
Davide Italiano
b4745db904 [MPS] Add support for i0e in eager. (#149174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet
2025-03-14 02:51:28 +00:00
Nikita Shulga
924a247fbb [MPS] Enable angle and atan2 for torch.long (#149017)
This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017
Approved by: https://github.com/dcci
2025-03-12 04:48:52 +00:00