Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if base.storage().data_ptr() != other.storage().data_ptr():
```
(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )
Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
```cpp
if (getMPSProfiler().profiler.isCaptureEnabled()) {
getMPSProfiler().startCapture(__func__, mpsStream);
}
...
if (getMPSProfiler().isCapturing()) {
getMPSProfiler().stopCapture(mpsStream);
}
```
that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">
- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`
TODOs:
- Add weight pack
- Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)
A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.
It turns out this issue also affects boolean types:
```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```
In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.
Fixes#96614#106112#109166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
Validate that all arguments are on MPS devices and dtypes are expected
Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```
Fixes https://github.com/pytorch/pytorch/issues/123995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks
Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Fixes https://github.com/pytorch/pytorch/issues/124335
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.
I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.
I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.
Fixes#98212huggingface/transformers#22468huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
By creating constants using input tensors dtype
One line reproducer:
```
python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))"
```
Before the change
```
loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
After
```
tensor(-33.7812, device='mps:0', dtype=torch.float16)
```
Fixes https://github.com/pytorch/pytorch/issues/124252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258
Approved by: https://github.com/kulinseth
Fixes#122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.
Before the fix on MacOS 14.4:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```
After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```
This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.
Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
# Motivation
In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.
### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):
Input tensors:
```
AAAA BBB CC
AAAA BBB
BBB
```
Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```
### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):
Input tensors:
```
AAAA BBB CC DD
AAAA BBB 00 DD
BBB DD
000 DD
```
Reduce-scatter-copy-in first pad:
```
AAAA BBB CC DD
AAAA BBB 00 DD
BBB DD
000 DD
```
Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```
The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.
# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:
```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```
This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.
## Requirements on input
1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
By deleting `where_mps` and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends.
Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
By just calling `std_mps` and `mean` in sequence
Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer
import torch
from torch.utils.benchmark import Measurement, Timer
def bench_var_mean(
m, n, k,
dtype = torch.float32,
device:str = "cpu",
) -> Measurement:
setup = f"""
x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
"""
t = Timer(
stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
)
return t.blocked_autorange()
for x in [100, 1000]:
rc = bench_var_mean(1000, x, 100, device="mps")
print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.
Fixes https://github.com/pytorch/pytorch/issues/119663
TODOs:
- Refactor the codebase and implement proper composite function (that must be faster)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor
Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`
Fixes https://github.com/pytorch/pytorch/issues/119088
Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
Fixes#114285
(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.
Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
for (int64_t i = 0; i < M; i += tile_size) {
const auto i_end = std::min(i + tile_size, M);
NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
for (int64_t j = 0; j < K; j += tile_size) {
const auto j_end = std::min(j + tile_size, K);
MPSGraphTensor* tile = nil;
for (int64_t k = 0; k < N; k += tile_size) {
const auto k_end = std::min(k + tile_size, N);
auto selfChunk = [graph sliceTensor:selfTensor
starts:@[ @(i), @(k) ]
ends:@[ @(i_end), @(k_end) ]
strides:@[ @(1), @(1) ]
name:nil];
auto otherChunk = [graph sliceTensor:otherTensor
starts:@[ @(k), @(j) ]
ends:@[ @(k_end), @(j_end) ]
strides:@[ @(1), @(1) ]
name:nil];
auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];
tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
}
[row_chunks addObject:tile];
}
auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
[rows addObject:row];
}
return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```
One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
- `assert(a == b)` -> `assert a == b`
- `if(x > y or y < z):`->`if x > y or y < z:`
- And `return('...')` -> `return '...'`
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706
The follow test would fail without this bug fix:
```
import torch
def test_erfinv():
for device in ['cpu', 'mps']:
x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
y = x[2:].erfinv()
x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
y2 = x2.erfinv()
print(y)
print(y2)
torch.testing.assert_close(y, y2)
print(f"{device} passes.")
test_erfinv()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
Fixes#77818
I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:
> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)
Please, let me know if there are any other changes needed and I'll be happy to implement them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL
======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
method(*args, **kwargs)
File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory
To execute this test, run the following from the base repo dir:
python test/test_mps.py -k test_copy_cast_no_leak
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 1.102s
FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
...
```
Fixes https://github.com/pytorch/pytorch/issues/114096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
Fixes#100764
This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.
For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.
For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125
See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
Fixes#86805
Adds support for sgn to MPS backend.
Notes:
1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.
2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula. Check comments.
This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.
Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device
Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>
> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
Fixes issue mentioned in #77764
e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744
Adds MPS support for the following ops:
- lgamma
- mvlgamma
- digamma
- polygamma
The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.
NOTE:
The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)
The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
Fixes#83196
Now, MPS implementation is blazingly fast.
Though, I have several questions on improving this PR:
1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>
> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_
Fixes https://github.com/pytorch/pytorch/issues/107867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator.
Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported
Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block.
Fixes https://github.com/pytorch/pytorch/issues/78503
TODOs(in followup PRs):
- Implement backwards (requires complex mul and sgn)
- Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR)
Partially addresses https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324
Approved by: https://github.com/albanD
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
Related to #77764
Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688
Approved by: https://github.com/kulinseth
I've added the implementation of erfinv using the algorithm from 4154c8ea15/aten/src/ATen/native/Math.h (L152) in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661
Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")
# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x: {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time: 2.654937833400254
MPS torch.erfinv time: 0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv: tensor(4.2339e-14)
Largest difference
x: -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06
Fixes #https://github.com/pytorch/pytorch/issues/86808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507
Approved by: https://github.com/kulinseth
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8f8d620</samp>
This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153
Approved by: https://github.com/drisspg
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 069fd23</samp>
This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch.
Fixes https://github.com/pytorch/pytorch/issues/100717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946
Approved by: https://github.com/kulinseth
In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value.
MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s).
Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108
Approved by: https://github.com/kulinseth
Prevent using parallel computing when deterministic algorithm is set.
Fixes#97574
Benchmark:
```
[--------------- index_put_ Deterministic Algorithm Enabled ---------------]
| cpu | mps
1 threads: -----------------------------------------------------------------
Dtype: torch.float32 Features: 1024; Num Indices: 512 | 37 | 49
Dtype: torch.float32 Features: 1024; Num Indices: 1024 | 54 | 50
Dtype: torch.float32 Features: 1024; Num Indices: 2048 | 86 | 50
Dtype: torch.float32 Features: 1024; Num Indices: 4096 | 150 | 49
Times are in microseconds (us).
[-------------- index_put_ Deterministic Algorithm Disabled ---------------]
| cpu | mps
1 threads: -----------------------------------------------------------------
DType: torch.float32 Features: 1024; Num Indices: 512 | 37 | 49
DType: torch.float32 Features: 1024; Num Indices: 1024 | 53 | 49
DType: torch.float32 Features: 1024; Num Indices: 2048 | 86 | 49
DType: torch.float32 Features: 1024; Num Indices: 4096 | 147 | 50
Times are in microseconds (us).
```
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at ebf2ff3</samp>
Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660
Approved by: https://github.com/kulinseth
Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367
Approved by: https://github.com/albanD
Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR
Example:
Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])`
call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])`
call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded)
call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated)
Test Plan:
**Unit Tests**
```
buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static'
```
**PT2 Export with `nonzero_static()`**
Example of `GraphModule` in the exported graph
```
def forward(self, x):
arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4); arg0 = None
return pytree.tree_unflatten([nonzero_static_default], self._out_spec)
```
Differential Revision: D44324808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417
Approved by: https://github.com/ezyang
- Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad
- Added a test case
- The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988
Approved by: https://github.com/DenisVieriu97
The native implementation of LSTM has been fixed on macOS 13.
On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side.
Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable.
Fixes: #90421
Issues related: #80306, #83144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909
Approved by: https://github.com/albanD, https://github.com/kulinseth
Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"`
Modify `castMPSTensor` to become a no-op if cast is not needed
Define `common_dtype` as `c10::promoType` between self, tensor1 and
tensor2. Cast to any output type.
Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover
all the permutations
Discovered while looking at https://github.com/pytorch/pytorch/issues/96113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164
Approved by: https://github.com/kulinseth
I.e. attempt to create tensor of all possible types and make sure that
it raises a structured error for non-MPS types
Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before)
Fixes https://github.com/pytorch/pytorch/issues/95976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982
Approved by: https://github.com/kulinseth
MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817
Approved by: https://github.com/kulinseth
Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know).
Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS.
Did a small benchmark:
```python
for num in [10, 100, 1000, 10000]:
for shft in [1, 5]:
sz = num * num
x = torch.arange(sz, device="cpu").view(num, num)
s = time.time()
r = torch.roll(x, shft)
cpu_e = time.time() - s
x = torch.arange(sz, device="mps").view(num, num)
s = time.time()
r = torch.roll(x, shft)
mps_e = time.time() - s
print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}")
```
```
size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578
size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172
size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172
size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375
size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938
size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203
size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945
size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168
Approved by: https://github.com/albanD
Add `mps_ops_modifier` function that adds `unittest.expectedFailure` decorators to the operators that supposed to fail on MPS.
This allows one to know whether or not operation will fail, rather than skip it.
For example:
```
% python test_mps.py -v -k test_output_match_dot
test_output_match_dot_cpu_float32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int16 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int64 (__main__.TestConsistencyCPU) ... expected failure
test_output_match_dot_cpu_uint8 (__main__.TestConsistencyCPU) ... ok
----------------------------------------------------------------------
Ran 5 tests in 0.175s
OK (expected failures=1)
```
Moved a few functions from blocklist to xfail, and find out that some of the functions in the list actually work, for example `torch.long`.
Also, allow `None` to be used in `ALLOWLIST` instead of specifying all types explicitly (which aligns with `DecorateInfo` semantic)
Eventually, we should get rid of `ALLOWLIST` (i.e. all ops are allowed), keep small `BLOCKLIST` and move the rest to `XFAILLIST`
Add step to print HW/SW info before running MPS tests.
Fix type promotion in `trace_mps_out`
Introduce `MACOS_12_X_XFAILLIST` and skip almost every function for `torch.uint8`, although some of those doesn't make much sense and feels like a regression from PyTorch-1.13
Re-enabled MPS testing on MacOS 12, as runners seems to be available again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95045
Approved by: https://github.com/albanD
Fixes#91694Fixes#92615
Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`.
After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.
After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded
Funny enough, backward tests were completely disabled before and were not passing:
```python
@unittest.skipIf(True, "Backward of lstm returns wrong result")
def test_lstm_2(self, device="mps", dtype=torch.float32):
```
UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer
Fixes#94390
Apart from fixing the issue above, this PR also fixes a bug that when an input tensor can be sliced, a sliced array view is created. This array view seems to be not writable or have a different storage from the original tensor, causing incorrect results with the in-place `fill`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95113
Approved by: https://github.com/kulinseth
Previously, the "can slice" flag in Placeholder constructor in `OperationUtils.mm` is conditioned on whether the numbers of dimensions of base shape and view shape are the same. This doesn't consider the situation that a view tensor could be the base tensor's sliced and then unsqueezed version, resulting in different num of dims.
For example, if we want to stack `y_mps` and `x_mps` on the last dim:
```
t_mps = torch.tensor([1, 2, 3, 4], device="mps")
x_mps = t_mps[2:] # [3, 4]
y_mps = t_mps[:2] # [1, 2]
res_mps = torch.stack((y_mps, x_mps), dim=-1)
```
the kernel will unsqueeze both of them on the last dim and then concatenate them, which is equivalent to:
```
res_mps = torch.cat((y_mps.unsqueeze(-1), x_mps.unsqueeze(-1)), dim=-1)
```
`x_mps.unsqueeze(-1)` is an unsqueezed and contiguous tensor with a storage offset, this kind of tensors should be sliceable without cloning its storage.
Fixes#87856Fixes#91065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91071
Approved by: https://github.com/kulinseth
Fixes backward pass for bilinear.
Summary of changes:
- bilinear op is able to produce **contiguous, non-view** tensors with a storage offset, such as: shape=`[1, 1, 1, 1]`, `storage_offset=12`. This seems a weird case, but it is valid, and for these type of tensors we wouldn't be able to gather/scatter since we look at the view flag (which is not set here). This change looks into `storage_offset` only rather than the is_view flag which is not being set
- **reduction sum** must return a zeroed out output if passing an input with 0 elements (e.g a shape of (0, 5)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94892
Approved by: https://github.com/kulinseth
- Backward pass has to give explicit bias tensor of zeros if none is passed to the op or the bias gradient will not be calculated.
- Fixed bias tensor mistakenly getting overwritten to zeros
- Fixes crash when lstm op called with has_biases set to false. Change takes into account the changed shape of the input params TensorList depending on the bias flag.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94889
Approved by: https://github.com/DenisVieriu97
- To check for Memory Leaks in `test_mps.py`, set the env-variable `PYTORCH_TEST_MPS_MEM_LEAK_CHECK=1` when running test_mps.py (used CUDA code as reference).
- Added support for the following new python interfaces in MPS module:
`torch.mps.[empty_cache(), set_per_process_memory_fraction(), current_allocated_memory(), driver_allocated_memory()]`
- Renamed `_is_mps_on_macos_13_or_newer()` to `_mps_is_on_macos_13_or_newer()`, and `_is_mps_available()` to `_mps_is_available()` to be consistent in naming with prefix `_mps`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94646
Approved by: https://github.com/malfet
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
`torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`
Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
Fixes#87219
Implements new ``repeat_interleave`` function into ``aten/src/ATen/native/mps/operations/Repeat.mm``
Adds it to ``aten/src/ATen/native/native_functions.yaml``
Adds new test ``test_repeat_interleave`` to ``test/test_mps/py``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88649
Approved by: https://github.com/kulinseth
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
`torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`
Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
Summary:
- Remove redundant bool casts from scatter/gather
- Make the workarounds for scatter/gather (for bool/uint8 data types) OS specific - use them only in macOS Monterey, ignore them starting with macOS Ventura
- Make all tensors ranked in scatter
Fixes following tests:
```
test_output_match_slice_scatter_cpu_bool
test_output_match_select_scatter_cpu_bool
test_output_match_diagonal_scatter_cpu_bool
test_output_match_repeat_cpu_bool
test_output_match_rot90_cpu_bool
etc..
```
Still failing on macOS Monterey (needs additional investigation):
```
test_output_match_scatter_cpu_bool
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94464
Approved by: https://github.com/kulinseth
- Also fix FP16 correctness issues in several other ops by lowering their FP16 precision in the new list `FP16_LOW_PRECISION_LIST`.
- Add atol/rtol to the `AssertEqual()` of Gradient tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94567
Approved by: https://github.com/kulinseth
Fixes batchnorm forward/backward pass and layer_norm:
Batchnorm Forward pass:
```
- fix batch_norm_mps_out key
- return 1/sqrt(var+epsilon) instead of var
- return empty tensor for mean and var if train is not enabled
- remove native_batch_norm from block list
```
Batchnorm Backward pass:
```
- add revert caculation for save_var used in backward path
- add backward test for native_batch_norm and _native_batch_norm_legit
```
Layer norm:
```
- remove the duplicate calculation from layer_norm_mps
- enable native_layer_norm backward test
- raise atol rtol for native_layer_norm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94351
Approved by: https://github.com/razarmehr
Calculate nonzero count directly in the nonzero op.
Additionally, synchronize before entering nonzero op to make sure all previous operations finished (output shape is allocated based on the count_nonzero count)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94442
Approved by: https://github.com/kulinseth
- fix num_output_dims calculation
- fix median_out_mps key
- cast tensor sent to sortWithTensor and argSortWithTensor
- note down same issue for unique
- unblock median from blocklist
- adding test_median_int16 test
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94489
Approved by: https://github.com/razarmehr
- Fix wrong results in AvgPool2D when `count_include_pad=True`
- Fix issues with adaptive average and max pool2d
- Remove the redundant blocking copies from `AdaptiveMaxPool2d`
- Add `divisor` to cached string key to avoid conflicts
- Add test case when both `ceil_mode` and `count_include_pad` are True (previously failed).
- Clean up redundant code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94348
Approved by: https://github.com/kulinseth
Skip gather/blit calls in case of strided output - this prevents:
- allocating additional memory for the output
- additional transpose for both the input and output
Fixes:
```
x = torch.rand((256,10), device='mps')
x = x.permute(1,0)
x.exp()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94260
Approved by: https://github.com/razarmehr
Fixes TestConsistency masked_fill for bool data type.
Casting a tensor > 1 to MPSDataTypeBool will result in 0 instead of 1. This change manually casts the scalar to a value of 0 or 1 when casting a non-boolean tensor to a boolean tensor:
```
(inputDataType == MPSDataTypeBool) ? !!value.to<double>() : value.to<double>()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94263
Approved by: https://github.com/razarmehr
There are cases when the arrayViewTensor API cannot be used to solve the view operations, such as when a view dimension is bigger than the base dimension of the tensor, e.g:
```
base shape: [1, 768, 512, 2] // we cannot slice the base shape in any way to result in first dimension `2`
view shape: [2, 384, 512, 1]
```
On such cases, we need to fallback on the gather code (that detects this is a slice followed by a reshape) to solve this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94278
Approved by: https://github.com/razarmehr
- Fix correctness issues with nll_loss_backward(), smooth_l1_loss_backward() and cross_entropy_backward() by taking grad_output into account when computing those loss ops
- Add numel()==0 check to prevent crashes
- Clean up and formatting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94226
Approved by: https://github.com/kulinseth
Attempts to fix#92656
BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
Fixes https://github.com/pytorch/pytorch/issues/86975
If the destination is a strided MPS tensor and the source is a CPU tensor, we cannot perform a blit directly to copy the memory from the CPU tensor into the MPS tensor. We need to scatter the data into the right indices.
```
a1 = torch.Tensor([[1,2],[3,4], [5,6]]).to(torch.device("mps"))
b1 = torch.Tensor([-1, -1])
a1[1:,1] = b1 # strided MPS destination / contiguous CPU source
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91784
Approved by: https://github.com/kulinseth
Currently, most of the reduction ops are flattening the input tensor to 1D to perform the operation.
This change removes the flattening of the tensors / the unranked placeholders and adds support for multi axes in all the reduction ops.
- Fixes reduction ops with correctness and shape issues.
- Fixes masked.argmax / masked.argmin. In case of passing inf to argmax / argmin, MPS will return nan as index for these numbers. Casting this nan to Long will make it -1. This change avoids negative values by clamping them to 0 (matching CPU results).
TestConsistency issues fixed:
```
std
var
amax
amin
sum
prod
mean
count_nonzero
masked.amax
masked.amin
masked.mean
masked.prod
masked.std
masked.sum
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91734
Approved by: https://github.com/kulinseth
- Fixed the memory leak with the `malloc()`
- Introduced shortened data type strings (optional) to avoid getting extra long cached graph string keys with ops such as cat_out()
- Fixed data type issues in Monterey
- Removed the unused `use_scalar_value` argument from `getTensorsStringKey()`
- Clean up and refactoring
Fixes#89353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91786
Approved by: https://github.com/kulinseth
- Workaround for MaxPool when ceilMode=true
- Workaround for ChannelsLast memory format
- Workaround for divisor_override in AvgPool ops
- Enabled count_include_pad parameter for AvgPool
- Refactoring and clean up of duplicate code
- Enable MaxPool tests in TestConsistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91519
Approved by: https://github.com/kulinseth, https://github.com/malfet
- Implemented following new ops: upsample_nearest1d_backward
upsample_nearest_exact1d
upsample_nearest_exact1d_backward
- Moved Upsample code from Shape.mm to Upsample.mm
- Fallback to CPU for nearest mode on Monterey
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91669
Approved by: https://github.com/malfet
Fixes copies into slices where the input data type is different than the output dtype.
This change removes the cast done before scatter, so we don't have to allocate additional memory to perform the casting. Scatter handles the casting directly now.
device = "mps"
shape = (4, 4)
tensor = torch.randint(10, shape, device=device)
tensor_before = tensor.clone()
res = torch.empty(shape[0], shape[1] * 2, device=device)[:, ::2].copy_(tensor)
torch.testing.assert_close(tensor, tensor_before)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91197
Approved by: https://github.com/razarmehr
Use Prims to implement group_norm, group_norm_backward and mean_var
Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.
Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private
Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass
Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
The `multiplicationWithPrimaryTensor` and/or `scatterWithDataTensor` api has issues with handling two f16 tensor inputs, resulting in zeros outputs. With int16 or int64 inputs, there are issues as well.
This PR conditionally casts inputs to f32 if they're not and then casts the output back to the source's datatype.
Fixes#82645.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88542
Approved by: https://github.com/kulinseth
Preparation for the next PR in this stack: #89559.
I replaced
- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).
There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
## Summary ⚡
**Aim**: Add support for aten::median for MPS backend (Fixes#87220)
This is fresh clean PR from the previous [PR](https://github.com/pytorch/pytorch/pull/88554)
- Implementing the new median function in aten/src/ATen/native/mps/operations/ReduceOps.mm
- Adding it to aten/src/ATen/native/native_functions.yaml
- Adding it to existing test_median
### **this will works like this** 🪶
median of entire input tensor on MPS
`torch.median(mps_inputTensor)`
median of along a dim
`torch.median(mps_inputTensor, dim=[int], keepdim=[Bool])`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88807
Approved by: https://github.com/kulinseth
Various code cleanup in MPS operations:
- Per @kulinseth suggestion move `mpsSupportsCumsum` to `MPSDevice.h` and rename it to
`is_macos_13_or_newer()`
- Move Ventura MPSGraph new operators to `MPSGraphVenturaOps.h` header
- Use `LookupAs` and `CreateCachedGraphAs` to make code more compact
- Formatting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88529
Approved by: https://github.com/kulinseth
Fixes#86744
- Implementing the new `expm1_out_mps` function in `aten/src/ATen/native/mps/operations/UnaryOps.mm`
- Adding it to `aten/src/ATen/native/native_functions.yaml`
- Adding it to existing `test.test_mps.TestNLLLoss.test_unary_ops`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87147
Approved by: https://github.com/kulinseth
Enable a test that would have caught https://github.com/pytorch/pytorch/issues/86239
Prior to the fix for that bug, this test fails with
```
_____________________________ TestCommonMPS.test_numpy_ref_mps_where_mps_float32 _____________________________
Traceback (most recent call last):
File "/Users/alex/git/pytorch/test/test_ops.py", line 197, in test_numpy_ref_mps
self.compare_with_reference(
File "/Users/alex/git/pytorch/torch/testing/_internal/common_utils.py", line 2366, in compare_with_reference
actual = torch_fn(t_inp, *t_args, **t_kwargs)
File "/Users/alex/git/pytorch/torch/testing/_internal/opinfo/core.py", line 1068, in __call__
return self.op(*args, **kwargs)
File "/Users/alex/git/pytorch/torch/testing/_internal/common_methods_invocations.py", line 15167, in <lambda>
op=lambda self, condition, other: torch.where(condition, self, other),
RuntimeError: 0'th index 3 of x tensor does not match the other tensors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87342
Approved by: https://github.com/albanD
Tensor's view in linear storage is represented by the following parameters: `.shape`, `.stride()` and `.storage_offset()`.
Only tensors that are representable as 1d-views can be copied from host to device (and vice versa) using single [`copy(from:sourceOffset:to:destinationOffset:size:)`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) call.
Modify `copy_to_mps_` function to do the following steps:
- Cast `src` tensor to dst data type if needed
- Expand `src` tensor to `dst` tensor shape
- Clone `src` tensor if it is not stride contiguous (i.e. can not be represented by `src.view(src.numel())`)
- Create an empty tensor if `dst` is not stride-contiguous or if its strides are different then potentially cloned `src` strides
- Do 1d copy for `src` to (potentiall temp) `dst`
- Finally do re-striding/copy on MPS if needed
Add test to cover cases where stide-contiguous permuted tensor is copied to MPS, non-stride-contiguous tensor is copied to MPS and if permuted CPU tensor is copied to differently permuted MPS tensor
Fixes https://github.com/pytorch/pytorch/issues/86954
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86956
Approved by: https://github.com/kulinseth
Also, make sure it raises catcheable errors if invoked with integral types
Otherwise, it used to fail with following fatal error invoked for `torch.half` and with similar signatures if invoked for integral types
```
loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/4883e71d-37bd-11ed-b0ef-b25c5e9b9057/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<2xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
Modified `test_gelu_simple` to check both fwd and backward gradients for gelu