Commit Graph

680 Commits

Author SHA1 Message Date
Nikita Shulga
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
Nikita Shulga
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
Nikita Shulga
da923afdc7 [MPS][BE] Align bitshift behavior with CPU (#148719)
By casting the argument to output type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719
Approved by: https://github.com/Skylion007
ghstack dependencies: #148685, #148686
2025-03-07 18:28:14 +00:00
Nikita Shulga
f84710aef4 [MPS] Fix scalar to tensors bitshifts (#148686)
By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar`

Add regression tests

Please note, that for some undefined values MPS and CPU behaviors are different, for example
```
>>> import torch
>>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8)
tensor([255, 255, 255, 255, 255, 127,  63,  31,  15,   7,   3,   1],
       device='mps:0', dtype=torch.uint8)
>>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8)
tensor([255, 127,  63,  31,  15,   7,   3,   1,   0,   0,   0,   0],
       dtype=torch.uint8)
```
Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done

Fixes https://github.com/pytorch/pytorch/issues/147889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686
Approved by: https://github.com/albanD
ghstack dependencies: #148685
2025-03-07 18:28:14 +00:00
Isalia20
02e1580e39 [MPS] fix crash for mse loss with 0 numel inputs (#148608)
Fixes #148589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608
Approved by: https://github.com/malfet
2025-03-06 03:32:34 +00:00
Nikita Shulga
864b75dd50 [MPS] Fix unary_kernel_strided logic (#148512)
Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350
Before this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[  0.0000,   1.4142,   2.0000,   2.4495],
        [ 80.0000,  82.0000,  84.0000,  86.0000],
        [ 96.0000,  98.0000, 100.0000, 102.0000],
        [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0')
```
After this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[0.0000, 1.4142, 2.0000, 2.4495],
        [4.0000, 4.2426, 4.4721, 4.6904],
        [5.6569, 5.8310, 6.0000, 6.1644],
        [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0')
```
One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't)

Add regression test to prevent those from happening again

Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked)

Revived needs_copy_logic, though it  will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512
Approved by: https://github.com/janeyx99
2025-03-05 15:57:54 +00:00
Isalia20
0c0a4baddd [MPS] unary kernels - avoid copying tensors if they have same stride (#148350)
I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one.

Times for reference(on transposed tensor where matrix is NxN matrix):

| N     | time_old           | time_new           |
|-------|--------------------|--------------------|
| 100   | 0.0002241021       | 0.0001548659       |
| 1000  | 0.0005934822       | 0.0002150342       |
| 10000 | 0.3242016407       | 0.0045755033       |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350
Approved by: https://github.com/janeyx99
2025-03-04 23:20:26 +00:00
Isalia20
439395c0ae [MPS] add slogdet and logdet implementations to mps (#148287)
Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287
Approved by: https://github.com/malfet
2025-03-04 19:49:23 +00:00
Nikita Shulga
84502baaff [MPS] Fix sqrt and other for torch.chalf (#148285)
Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test
```
% python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())"
```
Fail with
```
RuntimeError: Failed to create function state object for: sqrt_complex_half_half
```

As sqrt is not implemented for CPU, add explicit test to `test_sqrt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285
Approved by: https://github.com/dcci
2025-03-03 16:03:54 +00:00
Isalia20
19de523de6 [MPS] metal unary kernel for sqrt (#148272)
Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue

## Speedups:

Matrix size means NxN matrix here.
![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81)

Code to generate the times(needs building the torch with old time and new time):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [1, 100, 1000, 10_000]
num_runs = 1000
warmup_runs = 3

def run_sqrt(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = torch.sqrt(A)
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.rand((n, n), dtype=torch.float32, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_sqrt(A_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_sqrt(A_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272
Approved by: https://github.com/malfet
2025-03-02 00:45:45 +00:00
Nikita Shulga
3a0c9f7f9d [MPS] Fix SDPA crash (#148239)
If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes

Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239
Approved by: https://github.com/dcci
2025-03-01 13:06:51 +00:00
Nikita Shulga
735d7b1af6 [EZ][BE] Increase tolerances for interpolate op (#148224)
Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation
But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148154, #148187, #148211
2025-03-01 13:03:59 +00:00
Isalia20
08434df1f2 [MPS] fix empty place holder error for smooth l1 loss (#148133)
Fixes #123171

And parametrizes the tests for it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-01 02:32:45 +00:00
Nikita Shulga
e5e31050d3 [MPS] Implement linear1d as shader (#148154)
And get rid of MPS call, as for some reason implementation via MPSGraph
API call is 100x+ times slower that Metal shader, at least according to
the following benchmark
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS        |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16  ")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results after the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    2.5  |   2.1  |   2.2  | 161.4  | 115.0  | 161.1
```
And before the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  354.0  | 336.0  | 332.4  | 145.5  | 114.7  | 148.3
```

Fixes https://github.com/pytorch/pytorch/issues/144245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154
Approved by: https://github.com/dcci
2025-02-28 16:47:42 +00:00
Davide Italiano
683e083e8d [MPS] Add support for entr() in eager. (#147948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948
Approved by: https://github.com/malfet
2025-02-26 19:55:02 +00:00
Nikita Shulga
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00
Nikita Shulga
9ed40af917 [BE][EZ] Delete MacOS-12.3 xfail list (#147905)
As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905
Approved by: https://github.com/dcci
ghstack dependencies: #147892
2025-02-26 05:08:09 +00:00
Nikita Shulga
346bbefa63 [BE] Parameterize TestSDPA in test_mps.py (#147856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856
Approved by: https://github.com/Skylion007
2025-02-25 16:07:24 +00:00
Isalia20
a695aae89b [MPS] fix attention for >4d tensors (#147545)
Fixes #147443

and adds tests for >4d tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:55:28 +00:00
Davide Italiano
4e934ee5a7 [MPS] Add eager support for xlog1py. (#147687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687
Approved by: https://github.com/malfet
2025-02-24 01:23:59 +00:00
Nikita Shulga
f03e7f3801 [MPS] Workaround rng bug for 5D tensors (#147667)
For some reason MPSGraph returns repeated values is tensor dimention is
larger than 4, which can be clearly seen by running following
```swift
import Metal
import MetalPerformanceShadersGraph

func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) {
  let graph = MPSGraph()
  var dims = Array(repeating: 1, count: ndim)
  dims[0] = nelem
  let shape = dims.map { NSNumber(value: $0) }
  let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil)
  let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer])
}

func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) {
  let buf_data = buf.contents().assumingMemoryBound(to: Float.self)
  print(prefix)
  for i in 0..<nelem {
      print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n")
  }
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let nelem = 2
guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4)
printBuf("4D uniform", buf: buf, nelem: nelem)

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5)
printBuf("5D uniform", buf: buf, nelem: nelem)
```

Workaround by flatting the tensor if it's contiguous

Fixes https://github.com/pytorch/pytorch/issues/147624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667
Approved by: https://github.com/dcci
2025-02-23 16:52:01 +00:00
Nikita Shulga
198ffbdf11 [MPS] Implement and test round.decimals (#147266)
If inductor can do it, why not eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266
Approved by: https://github.com/Skylion007
ghstack dependencies: #147286
2025-02-16 23:17:13 +00:00
tim
b9a22b3f37 bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623)
This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape.

The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)), 3d q/k/v were passed into sdpa function, which lead to an error.

Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms.

---
reproduce code:
```
import torch
import torch.nn.functional as F

head_num, seq_len, embed_dim = 16, 16, 80
bsz = 1

q = torch.randn(head_num, seq_len, embed_dim)
k = torch.randn(head_num, seq_len, embed_dim)
v = torch.randn(head_num, seq_len, embed_dim)
attention_mask = torch.ones(1, seq_len, seq_len)

oo_cpu = F.scaled_dot_product_attention(
    q.to("cpu"),
    k.to("cpu"),
    v.to("cpu"),
    attention_mask.to("cpu"),
    dropout_p=0.0
)

if torch.backends.mps.is_available():
    oo_mps = F.scaled_dot_product_attention(
        q.to("mps"),
        k.to("mps"),
        v.to("mps"),
        attention_mask.to("mps"),
        dropout_p=0.0
    )
    assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5)
```

error outputs:
```
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module>
    oo_mps = F.scaled_dot_product_attention(
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

hardware and envs:
```
torch               2.6.0
apple m3 max
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-13 07:00:51 +00:00
Isalia20
17a808557c [MPS] cholesky ex version (#146799)
PR #145701 didn't have experimental version of cholesky. This PR adds that version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799
Approved by: https://github.com/malfet
2025-02-13 07:00:21 +00:00
Isalia20
d763093b49 [MPS] fix lu factor for large tensors with bs>1 (#146753)
Try this:
```python
import torch

batch_size = 2
A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps")
A_cpu = A.cpu()
LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu)
LU, pivots = torch.linalg.lu_factor(A)
torch.testing.assert_close(LU.cpu(), LU_cpu)
```
You'll get huge difference in LU tensors
<img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-11 00:37:07 +00:00
Davide Italiano
dfe3b64282 [mps] Implement eager support for spherical_bessel_j0 (#146818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-10 16:58:05 +00:00
Nikita Shulga
611ca163fd [MPS] Add bilineard2d_aa implementation (#145526)
Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of
e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)

Error out on  uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526
Approved by: https://github.com/dcci
2025-02-10 15:03:14 +00:00
Isalia20
0ab67299c3 [MPS] lu unpack (#146681)
Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-08 00:16:17 +00:00
Nikita Shulga
624d94bdb8 [MPS] Extend torch.special.sinc to complex (#146648)
And to integral data types as well

Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$:
```math
\text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2}
```
```math
\text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648
Approved by: https://github.com/dcci
2025-02-07 01:12:37 +00:00
Davide Italiano
46390e9a37 [mps] Implement support for sinc() operator (inductor and eager). (#146539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 16:37:27 +00:00
Isalia20
0dc03134d9 [MPS] linalg solve implementation (#146531)
Fixes #98222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 00:57:49 +00:00
Davide Italiano
8a2000fd42 [MPS] Implement support for zeta (both eager and inductor). (#146465)
A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager.
Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465
Approved by: https://github.com/malfet
2025-02-05 13:55:50 +00:00
Nikita Shulga
aafaf4016f [MPS] Add error checking when dispatching kernel (#146458)
That thread-group size should not exceed maximum thread group size
Add regression test to validate that
Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458
Approved by: https://github.com/dcci
2025-02-05 02:56:40 +00:00
Isalia20
e3643e1e0e [MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279)
Requested in #77764

This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error:
```
-[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      python det.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 06:06:43 +00:00
Isalia20
5d55a6585d [MPS] lu factor ex implementation (#144651)
Implements `torch.linalg.lu_factor_ex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651
Approved by: https://github.com/malfet
2025-02-02 15:09:49 +00:00
Nikita Shulga
99a0940991 [MPS] Fix regression in con-contig bitwise ops (#146085)
Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous

Fixes https://github.com/pytorch/pytorch/issues/145203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085
Approved by: https://github.com/dcci
2025-01-30 22:36:56 +00:00
Nikita Shulga
1fdb4d65c0 [MPS] Extend torch.mm/torch.bmm to integral types (#145809)
By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern.
Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15)
This also enables integer op flavors for:
- `addmv`
- `einsum`
- `inner`
- `linalg.multi_dot`
- `matmul`
- `mv`
- `tensordot`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809
Approved by: https://github.com/dcci
2025-01-30 19:35:25 +00:00
Isalia20
b75afa2e2e [MPS] cholesky implementation (#145701)
Requested in #77764

Closed #144193  due to a lot of conflicts when rebasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701
Approved by: https://github.com/malfet
2025-01-27 01:53:03 +00:00
Nikita Shulga
3cf7874ebe [MPS][BE] Implement bilineard2d as shader (#145581)
That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation
569b785371/src/libImaging/Resample.c (L306-L309)
I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results.

But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation  of those values as floats and doubles.

Changes in the performance could be measured using the following script
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS                 |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results before
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  False |  True  |  True  |  True  |  True
Average Time (us) :  277.3  | 197.2  | 188.0  | 163.5  | 302.8  | 248.1  | 308.7  | 650.9
```
After(almost **100x** perf gain):
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    1.7  |   1.5  |   1.7  |   1.5  | 296.5  | 236.0  | 310.8  | 642.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581
Approved by: https://github.com/Skylion007
ghstack dependencies: #145578
2025-01-25 21:09:46 +00:00
soulitzer
3a3e2cf90a Remove det_singular OpInfo (#145533)
Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044

From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this.

Some older attempts:
- https://github.com/pytorch/pytorch/pull/102581
- https://github.com/pytorch/pytorch/pull/109249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533
Approved by: https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #145520, #145531
2025-01-25 00:58:03 +00:00
Nikita Shulga
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
Tom Ritchford
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
PyTorch MergeBot
46b92c025d Revert "Cholesky mps implementation (#144193)"
This reverts commit 727ae13318.

Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see aa4a1ff027/1 ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))
2025-01-16 21:37:32 +00:00
Isalia20
727ae13318 Cholesky mps implementation (#144193)
Requested in #77764

PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks:
- [x] Make `upper=True` work, only `upper=False` works now
- [x] Code cleanup
- [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out)
- [x] Checks for positive definite input
- [x] Support for (*, N, N) input, currently only supports (B, N, N) input
- [x] Support other dtypes(float16, bfloat16)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 16:26:46 +00:00
Nikita Shulga
46eeef9130 [MPS][BE] Surface syntax errors shader compilation (#144648)
Before this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
" UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
}
```
After this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
SyntaxError: program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648
Approved by: https://github.com/Skylion007
ghstack dependencies: #144647
2025-01-13 02:03:19 +00:00
Nikita Shulga
92ddb3d3d3 [MPS] Expose MPSProfiler::start/stopCapture to Python (#144561)
I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager.

For example, code below:
```python
import torch
import os

def foo(x):
   return x[:,::2].sin() + x[:, 1::2].cos()

if __name__ == "__main__":
    os.environ["MTL_CAPTURE_ENABLED"] = "1"
    x = torch.rand(32, 1024, device="mps")

    with torch.mps.profiler.metal_capture("compiled_shader"):
        torch.compile(foo)(x)
```
should capture the execution of a `torch.compile` generated shader
<img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561
Approved by: https://github.com/manuelcandales
ghstack dependencies: #144559, #144560
2025-01-11 02:05:36 +00:00
Nikita Shulga
e56768f030 [MPS] Fix bitwise shifts for uint8 (#144251)
Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops

Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function

Fixes https://github.com/pytorch/pytorch/issues/144190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249, #144250
2025-01-06 18:27:16 +00:00
Nikita Shulga
ebeb433e73 [BE] Fix + parametrize test_min_max_nan_propagation (#144250)
- `dtype` was not passed as argument to `torch.rand` before
- Condition bfloat16 testing on MacOS14+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249
2025-01-06 17:49:41 +00:00
Nikita Shulga
11a0663eeb [BE] Parametrize test_min_max (#144249)
It's better to have one unit test per dtype rather a combined one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249
Approved by: https://github.com/Skylion007
2025-01-06 17:49:41 +00:00
Davide Italiano
0dc1e6be19 [mps/BE] Fix linter warning/advice. (#144199)
Two spaces before an inline comment according to E261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 20:15:41 +00:00
Joona Havukainen
811c714911 Fix nan propagation for minimum() and maximum() in MPS (#144086)
Fixes #143976

- Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one.
 - Adds test for the NaN propagating case to `test_mps.py`.
- Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation.

Additions by @malfet:
 - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-01-04 18:48:24 +00:00
Steven Zeltmann
6f2451c2e9 [MPS] Add aten::angle (#143449)
This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449
Approved by: https://github.com/malfet
2025-01-04 15:38:40 +00:00
Nikita Shulga
301c457032 [MPS] Fix nllnd_loss_backward crash with different dtypes (#144170)
Otherwise, invoking with torch.half inputs, but float weights will result in
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification'
/AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification'
```

Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170
Approved by: https://github.com/kit1980, https://github.com/Skylion007
ghstack dependencies: #144167, #144162, #144083, #144084
2025-01-04 15:24:55 +00:00
isalia20
22580f160e Multinomial sampling fix on mps for non contiguous tensors (#141515)
Fixes #141457

As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 01:21:37 +00:00
Nikita Shulga
a93e75d1e2 [MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055)
Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests

Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463
failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055
Approved by: https://github.com/Skylion007
2025-01-02 17:12:39 +00:00
Nikita Shulga
c27c788e35 [MPS] Fix torch.add(x,y, alpha=2) crash (#143949)
TODO: as followup PR replace this weird logic with shaders

Fixes https://github.com/pytorch/pytorch/issues/143932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949
Approved by: https://github.com/Skylion007
ghstack dependencies: #143948
2024-12-30 17:16:29 +00:00
Nikita Shulga
3054aae493 [MPS] Fix fmin/fmax for scalar argument (#143934)
CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash)

Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934
Approved by: https://github.com/Skylion007
2024-12-28 17:07:19 +00:00
Joona Havukainen
33c27be017 Workaround for gather_out in MPS backend (#135543)
Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually.

Fixes https://github.com/pytorch/pytorch/issues/135240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-19 18:01:01 +00:00
Tom Ritchford
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
Nikita Shulga
24a18d76c8 [MPS] Use metal shaders for all view ops (#143375)
Before this PR Metal  shaders were used to scatter/gather 1-5 dimensional tensors.
This PR introduces generalized ones that could be used for any dimensionality and as results  gets rid of 700+ lines complex and untested code that might not even work as expected.
Generalized gather shader looks as follows
```metal
kernel void gather_kernel_n(uint linear_index           [[thread_position_in_grid]],
                            constant void * src_        [[buffer(0)]],
                            device void * dst_          [[buffer(1)]],
                            constant uint32_t * size    [[buffer(2)]],
                            constant uint32_t * stride  [[buffer(3)]],
                            constant uint32_t & numel   [[buffer(4)]],
                            constant int32_t & ndim     [[buffer(5)]]) {{
    if (linear_index >= numel) return;

    constant {0} * src = (constant {0} *)src_;
    device {1} * dst = (device {1} *)dst_;

    uint64_t src_offs = 0;
    auto src_idx = linear_index;
    for(int dim = ndim - 1; dim >= 0; --dim) {{
      src_offs += stride[dim] * (src_idx % size[dim]);
      src_idx /= size[dim];
    }}

    dst[linear_index] = cast<{1}>(src[src_offs]);
}}
```

Which, according to the following benchmark
```python
from timeit import default_timer

import torch
import torch.utils.cpp_extension
from torch.utils.benchmark import Measurement, Timer

t = Timer(
    stmt=f"y.copy_(x);torch.mps.synchronize()",
    setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)",
    language="python", timer=default_timer
)
print(t.blocked_autorange())
```
Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one

On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below
```swift
import Metal
import MetalPerformanceShadersGraph

func gatherComplexMPS(device: MTLDevice,
                inp_buf: MTLBuffer, idx_buf: MTLBuffer,
                out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4)
```

Fixes https://github.com/pytorch/pytorch/issues/143140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375
Approved by: https://github.com/albanD
2024-12-18 16:15:46 +00:00
Joona Havukainen
afa313e669 Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095)
The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250).

Fixes #141909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095
Approved by: https://github.com/kulinseth
2024-12-17 16:03:46 +00:00
Yu, Guangye
c1d4d9d3cf [MPS] Support torch.accelerator.synchronize() on mps (#143171)
# Motivation
Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return.

# Additional Context
Add a mps UT to test code change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171
Approved by: https://github.com/albanD
2024-12-16 02:18:32 +00:00
Nikita Shulga
8a04018329 [MPS] Fix conv backward for channels last (cont) (#143196)
This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input.

Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea

Fixes https://github.com/pytorch/pytorch/issues/142344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196
Approved by: https://github.com/manuelcandales
2024-12-13 21:32:42 +00:00
George Wigley
e0c8abda76 Fix potentially undefined behaviour in index_put sample input (#143116)
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:

> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.

Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.

This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
2024-12-13 17:59:01 +00:00
Nikita Shulga
95b17f6346 [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-11 02:00:51 +00:00
PyTorch MergeBot
393cf46f42 Revert "[MPS] Add CompileShader method (#141478)"
This reverts commit 0478fee42d.

Reverted https://github.com/pytorch/pytorch/pull/141478 on behalf of https://github.com/malfet due to Broke doctests, by trying to run MPS example on Linux ([comment](https://github.com/pytorch/pytorch/pull/141478#issuecomment-2533351909))
2024-12-11 00:37:10 +00:00
Nikita Shulga
0478fee42d [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-10 22:43:17 +00:00
Yu, Guangye
bee445c3a3 [MPS] Support torch.Event for MPS (#142468)
# Motivation
Support `torch.Event` on mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142468
Approved by: https://github.com/malfet
2024-12-10 21:17:25 +00:00
Nikita Shulga
d6481333ad [MPS] Add scatter_reduce.two (#141948)
Which has been request 20+ times on https://github.com/pytorch/pytorch/issues/77764 is just a flavor of out-of-box scatter-reduce, so all this op does is redispatches existing implementation.
Unsupported dtype/reduction type combinations:
 - min/max for int64
 - min/max for int32 on MacOS-14 or older

Following swift code demonstrates problem with scatterAlongAxis MPS call
```swift
import Metal
import MetalPerformanceShadersGraph

func scatterMPS(device: MTLDevice,
                inp_buf: MTLBuffer, upd_buf: MTLBuffer,
                idx_buf: MTLBuffer, out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .int64, name: nil)
  let updatesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.scatterAlongAxis(0, data: inputPlaceholder, updates: updatesPlaceholder, indices: indicesPlaceholder, mode: .min, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  let mpsUpdatesBuffer = MPSGraphTensorData(upd_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               updatesPlaceholder: mpsUpdatesBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues(device: MTLDevice, values: [Int64]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: Int64.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_elem = 4
let upd_elem = 4
let inp_buf = makeBufferWithValues(device: device, values: [1, 2, 3, 4])
let upd_buf = makeBufferWithValues(device: device, values: [Int64.max - 1, Int64.max - 2 , Int64.max >> 16 , 11])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:inp_elem * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

scatterMPS(device: device,
           inp_buf: inp_buf, upd_buf: upd_buf,
           idx_buf: idx_buf, out_buf: out_buf,
           inp_elem: inp_elem, upd_elem: upd_elem)

let obuf_data = out_buf.contents().assumingMemoryBound(to: Int64.self)
for i in 0..<inp_elem {
    print("out_buf[\(i)] = \(obuf_data[i])")
}
```
that prints `4294967294, 4294967293, 4294967295, 4` instead of expected `1, 2, 3, 4`
Where `torch.tensor([[1, 9223372036854775806], [2, 9223372036854775805], [3, 140737488355327], [4, 11]], dtype=torch.int64, device='mps').max(1)` yields an expected results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141948
Approved by: https://github.com/manuelcandales
2024-12-04 04:56:43 +00:00
Roy Hvaara
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
Roy Hvaara
4d5c096a55 [MPS] Add autocast rule for SDPA (#141776)
Fixes #141774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-29 03:34:03 +00:00
Nikita Shulga
65166d86a3 [MPS] Add regression test for sync deadlock (#141296)
See https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492434870
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141296
Approved by: https://github.com/huydhn
2024-11-22 00:56:33 +00:00
PyTorch MergeBot
5e54cf3687 Revert "Fix MPS synchronize by waiting for root buffer to complete (#140725)"
This reverts commit 9bc9d4cdb4.

Reverted https://github.com/pytorch/pytorch/pull/140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from  https://github.com/pytorch/pytorch/pull/127242 ([comment](https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492416501))
2024-11-21 21:56:22 +00:00
Nikita Shulga
a8794fd7df [MPS] Fix conv backward pass for channels last (#141009)
Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one)

This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation)

Fixes https://github.com/pytorch/pytorch/issues/140902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009
Approved by: https://github.com/manuelcandales
2024-11-20 19:50:31 +00:00
Siddharth Kotapati
9bc9d4cdb4 Fix MPS synchronize by waiting for root buffer to complete (#140725)
Makes https://github.com/pytorch/pytorch/issues/139550#issuecomment-2468860559 work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140725
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-11-19 23:10:24 +00:00
Nikita Shulga
9c88b08ac9 [BE] Replace skipIfMPS with expectedFailureMPS (#139940)
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13`  to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2024-11-15 03:48:37 +00:00
Roy Hvaara
b0d681417c [MPS] Reintroduce support for convolutions with output_channels > 65536 (#140726)
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.

I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.

Fixes #140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
2024-11-14 20:09:01 +00:00
Nikita Shulga
cd6ace1d15 [EZ] Delete unused xfailIfMacOS14_4Plus (#140735)
Issue was fixed by https://github.com/pytorch/pytorch/pull/130038 but decorator remained in place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140735
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-11-14 20:08:48 +00:00
Nikita Shulga
9d93c27025 Implement unfold_backward on MPS (#135411)
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel  receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`

For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.

This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales

Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
2024-11-13 23:04:15 +00:00
zeshengzong
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
Jiang, Yanbing
f77eb07662 Split int4wo weight packing (#139611)
Fixes https://github.com/pytorch/ao/issues/1117.

This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756.

Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611
Approved by: https://github.com/jerryzh168
2024-11-12 10:12:50 +00:00
Nikita Shulga
f5ffd55a32 [MPS] Add torch.special.i1 op (#140196)
By more-or-less copy-n-pasting 58b661cda2/aten/src/ATen/native/cuda/Math.cuh (L576)

Enable respective tests in test_mps.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140196
Approved by: https://github.com/Skylion007
2024-11-11 16:57:53 +00:00
Nikita Shulga
103cbd7231 [MPS] Restrict MSELoss to floating types (#139960)
Becuase if invoked with long type it crahses deep in MPSGraph framework and to keep parity with CPU

Add test that validates that if dtype is not floating, both CPU and MPS implementations will error out
Fix function name for `mse_loss_out_mps` as `__func__` for any structured op implementation is `impl`

Fixes https://github.com/pytorch/pytorch/issues/139723
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139960
Approved by: https://github.com/kimishpatel
ghstack dependencies: #139961, #139959
2024-11-08 00:28:54 +00:00
Sun, Jiayi
44df6522ee add Half/BFloat16 support for grid_sample on CPU (#134812)
Fix https://github.com/pytorch/pytorch/issues/127224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812
Approved by: https://github.com/Skylion007, https://github.com/mingfeima
2024-11-06 14:02:08 +00:00
Mikayla Gawarecki
ca43ecd599 Flip default on weights_only (#137602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #138936, #139221, #139433, #139541
2024-11-04 18:30:29 +00:00
Nikita Shulga
51adab0829 [MPS] Fix reduction ops outputs for empty tensors (#139446)
By adding a switch for all reduction types, that either sets it to given value or raises runtime error.
Before this change, reduction ops returned uninitialized values in many case

Fixes https://github.com/pytorch/pytorch/issues/139400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446
Approved by: https://github.com/Skylion007
2024-11-01 17:32:12 +00:00
Yoshimasa Niwa
6e85266a47 [MPS] Fixes SiLU on non-contiguous tensors (#139006)
Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0.
Orignally the problem was found at jy0205/Pyramid-Flow#113.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139006
Approved by: https://github.com/malfet
2024-10-30 15:44:59 +00:00
PyTorch MergeBot
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bd.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
Nikita Shulga
652a2ab93e [BE] Skip print(foo) tests (#139009)
Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output.
Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009
Approved by: https://github.com/Skylion007
2024-10-27 18:04:03 +00:00
Scott Wolchok
a3de067975 [PyTorch] Use 128-bit vectors for ARM64 (#137426)
The correct vector length for ARM64 is 128 bits (16
bytes). We were previously using double this, apparently just because
that would be the same length as AVX2.

Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716, #138744
2024-10-26 00:20:35 +00:00
Nikita Shulga
1b31248933 [EZ] Fix typo in test_mps.py (#138738)
s/emedding_weight/embedding_weight/

Stolen from 074766d9b4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738
Approved by: https://github.com/atalman
2024-10-23 22:15:35 +00:00
Tom Ritchford
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
Tom Ritchford
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
Nikita Shulga
de16159e56 [MPS] Fix sliced cast (#138314)
This fixes internal crash due to the invalid bufer size computation if sliced API is used

Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
  baseShape = src._base().sizes();
} else {
  baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
  flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.

When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes

Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
2024-10-19 05:17:09 +00:00
PyTorch MergeBot
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d75.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
Tom Ritchford
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
PyTorch MergeBot
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b40.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
Tom Ritchford
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
Nikita Shulga
0786b37260 [MPS] Add i0 op (#137849)
More-or-less verbatim copy of 47c8aa8090/aten/src/ATen/native/Math.h (L101)
Plus a bit of a MPS boilerplate code

Update test_mps.py to mark kaiser_window and i0 as passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849
Approved by: https://github.com/Skylion007
2024-10-14 22:50:01 +00:00
Nikita Shulga
ad38bad766 [MPS] Add tri[lu]_indices (#137648)
Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980
Copy-n-paste kernel implementation from 13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)

though use `float` instead of `double` for square root computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601, #137647
2024-10-10 23:41:06 +00:00