Commit Graph

16 Commits

Author SHA1 Message Date
Nikita Shulga
e2a5c42e7e [BE][MPS] Build metal kernels of MacOS-14+ (#159733)
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.

Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731, #159732
2025-08-03 20:53:58 +00:00
Xuehai Pan
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
Manuel Candales
0f47e76937 [MPS] Implement hardshrink metal kernel (#155304)
Implements the forward and backward hardshrink operators as Metal kernels.
In order to support the lambda parameter, we extend the `exec_unary_kernel`  and `exec_binary_kernel` methods. Now they take an optional Scalar and an optional ScalarType argument. When the optional ScalarType is provided, it overrides the type of the Scalar.
We add a new `REGISTER_UNARY_ALPHA_OP` macro, and modify the existing `REGISTER_BINARY_ALPHA_OP` to support the new feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155304
Approved by: https://github.com/malfet
2025-06-10 18:20:27 +00:00
Nikita Shulga
231eb9902b [MPS][BE] Extend ndim_and_dtypes to 4 elements (#155272)
Metal arguments must be 8 bytes aliged (or may be 16 bytes), so running
any strided (or typecasted) binary op with MTL_DEBUG_LAYER leads to
exception
```
% MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
2025-06-05 15:41:34.201 Python[86653:16826825] Metal API Validation Enabled
test_output_match_add_mps_bfloat16 (__main__.TestConsistencyMPS.test_output_match_add_mps_bfloat16) ...
validateComputeFunctionArguments:1083: failed assertion `Compute Function(add_strided_bfloat_bfloat): argument ndim[0] from buffer(7) with offset(0) and length(12) has space for 12 bytes, but argument has a length(16).'
zsh: abort      MTL_DEBUG_LAYER=1 python3 ../test/test_mps.py -v -k test_output_match_add
```

Extend it to 4 elements and pass output dtype, which will be used by
binary_op later on anyway

Test plan: Run abovementioned command with `MTL_DEBUG_LAYER=1` and make
sure everything passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155272
Approved by: https://github.com/angelayi, https://github.com/dcci, https://github.com/cyyever
2025-06-06 14:20:21 +00:00
Nikita Shulga
792736f9ac [BE][MPS] Pass alpha by reference (#152737)
As it's always a scalar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152737
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515
2025-05-03 08:31:45 +00:00
Nikita Shulga
34e9f0b5c6 [MPS] Migrate mul to TensorIterator (#152515)
What initially supposed to be a very straightforward change resulted in small refactor of binary op tensor generators when  invoked for mixed dtype, which surfaced via `test_output_grad_match_sinc_mps_float16` test failure.

If operands are of different dtype (in particular float16 tensor and float32 scalar), one must perform an operation with `opmath_t` (or `TensorIterator::common_dtype()`) precision, rather than casting both operands to output dtype and performing it then, which can be demonstrated via the following example:
```
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.pi)
tensor([ -5.8555,  19.4844,  -7.0703, -10.6562,  27.0000,  18.7812],
       dtype=torch.float16)
>>> torch.tensor([-1.8633, 6.2031, -2.2500, -3.3926,  8.5938,  5.9766], dtype=torch.half).mul(torch.tensor(torch.pi, dtype=torch.float16))
tensor([ -5.8516,  19.4844,  -7.0664, -10.6562,  26.9844,  18.7656],
       dtype=torch.float16)
```

Solve this problem for now, but introducing `REGISTER_OPMATH_BINARY_OP` that indicates that operands must be cast to opmath_t, before performing the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152515
Approved by: https://github.com/Skylion007, https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #152663
2025-05-03 02:35:03 +00:00
Nikita Shulga
a2c553cac6 [Metal] Extend typecasted op support to complex dtypes (#152504)
First of all, by extending `c10:🤘:cast_to` to work correctly with complex dtypes, by introducing two more specializations: one that casts complex to scalar, and another that casts scalar to complex (as default metal typecast will turn `float x` into `float2(x, x)`)

Add ComplexHalf and ComplexFloat enum values to `c10:🤘:ScalarTypes` and handle them in `val_at_offs(ptr, offs, type)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152504
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466, #152479
2025-04-30 05:32:07 +00:00
Siddharth Kotapati
663bcb68ba Implement metal kernel for basic MPS arithmetic ops using TensorIterator (#147644)
Add metal kernels for add, subtract, & lerp ops using TensorIterator. Should help resolve: https://github.com/pytorch/pytorch/issues/143874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147644
Approved by: https://github.com/malfet
2025-04-29 14:24:49 +00:00
Nikita Shulga
3a8171efad [MPS] Preserve in/out dtypes in binary_op name (#150024)
To be consistient with unary op and avoid silent correctness problems if someone will try to invoke the op with unexpected out dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150024
Approved by: https://github.com/dcci
2025-03-26 16:00:43 +00:00
Nikita Shulga
de68ddc68e [MPS] Fix metal ops with different dtypes (#149974)
By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes.

Been dreaded to write this PR for a while, as it end up to be pretty bulky:
 - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType`
 - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values
 - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types
 - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op

Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.)

|                                           |  dense-dense  |  transp-transp  |  dense-transp  |  transp-dense  |  dense-scalar  |  dense-bcast |
|-------------------------|---------------|----------------|----------------|----------------|---------------|--------------- |
|      fmax (torch.float16, torch.float16)  |     160.9     |      159.9      |     270.5      |     270.9      |     236.6      |     293.0
|      fmax (torch.float32, torch.float32)  |     176.9     |      171.0      |     273.7      |     293.5      |     242.6      |     294.2
|      fmax (torch.float32, torch.float16)  |     171.4     |      170.9      |     283.6      |     303.0      |     253.7      |     302.3
|      add (torch.float16, torch.float16)   |     218.0     |      223.6      |     221.0      |     222.0      |     214.9      |     218.3
|      add (torch.float32, torch.float32)   |     227.4     |      233.9      |     228.8      |     231.9      |     218.9      |     221.4
|      add (torch.float32, torch.float16)   |     226.1     |      227.5      |     227.5      |     226.9      |     177.0      |     190.8

TODOS:
 - Include input and output dtype in non-cast kernel name
 - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR`
- Extend mixed_dytpes testing via OpInfo

Fixes https://github.com/pytorch/pytorch/issues/149951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974
Approved by: https://github.com/manuelcandales
2025-03-26 07:03:21 +00:00
Nikita Shulga
8d1db7f39d [MPS][BE] Add c10/metal/common.h (#149955)
That could be shared between host and metal code
So far put only one constant, which is a maximum number of tensor dimentions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149955
Approved by: https://github.com/Skylion007, https://github.com/manuelcandales
2025-03-25 17:37:24 +00:00
Nikita Shulga
112f983056 [MPS] Replace indexed with strided flavor (#149730)
Which renders non-contiguous operations much faster for larger tensors, for example `fmax` of 1000x1000 strides tensors takes 270ms with new algorithm and 430ms with an old one, that needed additional tensor of 3e6 elements to function.

TODO: Add 64-bit indexing logic, as current implementation has the same limitation as `generateKernelDataOffsets`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149730
Approved by: https://github.com/dcci, https://github.com/manuelcandales
2025-03-24 19:37:51 +00:00
Nikita Shulga
f80bee4934 [MPS][BE] Move common binary ops macros to indexing.h (#149263)
And binary op invocation logic to OperationUtils.mm

This is a no-op change, additional sanity checks/logic improvements will be added as followups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263
Approved by: https://github.com/dcci
ghstack dependencies: #149262
2025-03-16 02:06:40 +00:00
Nikita Shulga
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
Nikita Shulga
842ffea445 [MPS][BE] Towards strided unary ops support (#148449)
Add generic functors kernels and rewrite all existing implementations into functors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448
2025-03-04 22:22:39 +00:00
Nikita Shulga
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00