Commit Graph

743 Commits

Author SHA1 Message Date
jhavukainen
d28868c7e8 Change skipIfs to xfails in test_mps.py for test_isin (#125412)
Follow-up to #124896 to move the added test to use expectedFailure instead of skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412
Approved by: https://github.com/kulinseth
2024-05-20 20:23:53 +00:00
Nikita Shulga
b8a706a321 [EZ][BE] Use untyped_storage in tests (#125838)
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if base.storage().data_ptr() != other.storage().data_ptr():
```

(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )

Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
2024-05-09 14:04:21 +00:00
Nikita Shulga
4e29e80bf0 Run MPS tests on MacOS Sonoma (#125801)
Those ones are running 14.4.1, so I wonder if they actually pass CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-05-09 13:43:12 +00:00
Denis Vieriu
58e045d03c [MPS] Fix strided ELU op (#125692)
Fixes https://github.com/pytorch/pytorch/issues/124834

Summary of changes:

In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692
Approved by: https://github.com/kulinseth
2024-05-08 01:34:40 +00:00
Denis Vieriu
ba27548679 [MPS] Remove in place views (causes too many crashes) (#124895)
Fixes https://github.com/pytorch/pytorch/issues/96153

Remove in place views as they are a general cause for many crashes.
Proper fix to handle views without copies will come in a different PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895
Approved by: https://github.com/kulinseth
2024-05-08 01:00:37 +00:00
Denis Vieriu
3fb53bb6a7 [MPS] Fix strided mse_loss (#125696)
Fixes https://github.com/pytorch/pytorch/issues/124621

Summary of changes:
- In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696
Approved by: https://github.com/kulinseth
2024-05-08 00:52:26 +00:00
Nikita Shulga
0fd1fc17c3 [MPS] Fix abs for complex types (#125662)
By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura.

Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks

Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types

Fixes https://github.com/pytorch/pytorch/issues/125135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662
Approved by: https://github.com/kulinseth
2024-05-07 22:15:20 +00:00
Nikita Shulga
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
Denis Vieriu
a40d6df448 [MPS] Native nonzero implementation (#125355)
Fixes https://github.com/pytorch/pytorch/issues/124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355
Approved by: https://github.com/kulinseth
2024-05-03 03:50:58 +00:00
Roy Hvaara
e15da7856c [MPS] Fix overflow in cumsum when dtype is bool (#125318)
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)

A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.

It turns out this issue also affects boolean types:

```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```

In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.

Fixes #96614 #106112 #109166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
2024-05-03 01:19:24 +00:00
Joona Havukainen
c451d108da Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
Addresses issue #124518, adds isin_Tensor_Tensor_out.

Tests added to test_mps.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-05-01 23:14:05 +00:00
Nikita Shulga
5944a53555 [MPS] Fix nextafter for negative values (#125029)
By changing the logic to on older MacOS:
```cpp
bits += ((input > 0) ^ (input > other)) ? 1 : -1;
```
And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available)

TODO:
  - Add tests for infs and denorms

Fixes https://github.com/pytorch/pytorch/issues/124985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029
Approved by: https://github.com/Skylion007
2024-04-27 02:58:05 +00:00
Nikita Shulga
db3a2d751c [MPS][BE] Error-check linear (#124952)
Validate that all arguments are on MPS devices and dtypes are expected

Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```

Fixes https://github.com/pytorch/pytorch/issues/123995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007
2024-04-25 23:25:20 +00:00
Nikita Shulga
abf3f90781 [MPS] Fix large copy (#124635)
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks

Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```

Fixes https://github.com/pytorch/pytorch/issues/124335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
2024-04-22 23:43:11 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Joël Tang
a6a3f2e06b [MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.

I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.

I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.

Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
2024-04-21 00:12:32 +00:00
Nikita Shulga
5677128cb8 [MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258)
By creating constants using input tensors dtype

One line reproducer:
```
python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))"
```

Before the change
```
loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
After
```
tensor(-33.7812, device='mps:0', dtype=torch.float16)
```

Fixes https://github.com/pytorch/pytorch/issues/124252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258
Approved by: https://github.com/kulinseth
2024-04-18 15:21:01 +00:00
xinan.lin
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
Pearu Peterson
d2b0c0a34e Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375)
As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375
Approved by: https://github.com/zou3519, https://github.com/peterbell10
2024-04-17 15:31:28 +00:00
FFFrog
acc466751b Add bfloat16 support to binary_cross_entropy for CPU (#123823)
Fixes #123715

As the title stated.

But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-04-17 09:44:07 +00:00
Joona Havukainen
05289a278c Fix for MPS regression in #122016 and #123178 (#123234)
Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.

Before the fix on MacOS 14.4:

```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```

After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```

This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-04-03 23:00:57 +00:00
PyTorch MergeBot
feabb645a7 Revert "Handle transposes in second batch of matrices in bmm (#122194)"
This reverts commit 251ad1232b.

Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))
2024-04-02 18:49:28 +00:00
Kulin Seth
251ad1232b Handle transposes in second batch of matrices in bmm (#122194)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194
Approved by: https://github.com/DenisVieriu97
2024-04-02 17:48:35 +00:00
Nikita Shulga
4c70ab26ef [MPS] Enable index_select for complex types (#122590)
Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick

Fixes https://github.com/pytorch/pytorch/issues/122427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590
Approved by: https://github.com/Skylion007
2024-03-25 16:57:35 +00:00
andrewor14
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
Roger Lam
40acc84aaf Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-18 19:38:15 +00:00
PyTorch MergeBot
0cc60a05da Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381)"
This reverts commit ca80d07ac7.

Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR.  Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))
2024-03-15 23:53:05 +00:00
Roger Lam
ca80d07ac7 Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-15 21:54:50 +00:00
Nikita Shulga
5498804ec2 [MPS] Fix naive matmul for BFloat16 (#121731)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD
2024-03-13 14:34:03 +00:00
Nikita Shulga
07330ff7b6 [MPS][BE] Define _compute_tolerances (#121754)
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.

Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
2024-03-13 04:08:06 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
Boyuan Feng
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
Nikita Shulga
9b03a06288 [BE] [MPS] Fix out resize logic in torch.where (#121476)
By deleting `where_mps`  and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and  CUDA backends.

Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
2024-03-08 18:59:37 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Kai
c59b14163b Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2024-02-26 23:04:52 +00:00
Nikita Shulga
53bfae2c06 [MPS] Add torch.fft. support (#119670)
Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 )

When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing.

Fixes https://github.com/pytorch/pytorch/issues/78044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670
Approved by: https://github.com/kulinseth, https://github.com/albanD
2024-02-20 18:23:06 +00:00
Nikita Shulga
eb9a3383c2 [MPS] Add naive std_mean implementation (#119777)
By just calling `std_mps` and `mean` in sequence

Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer

import torch
from torch.utils.benchmark import Measurement, Timer

def bench_var_mean(
    m, n, k,
    dtype = torch.float32,
    device:str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()

for x in [100, 1000]:
    rc = bench_var_mean(1000, x, 100, device="mps")
    print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.

Fixes https://github.com/pytorch/pytorch/issues/119663

TODOs:
 - Refactor the codebase and implement proper composite function (that must be faster)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
2024-02-13 21:51:29 +00:00
Nikita Shulga
15ef52a015 [MPS] Enable conj and conj_physical (#119669)
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor

Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
2024-02-13 02:27:51 +00:00
Nikita Shulga
8d8fb9783c [MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681)
By using `view_as_real` when type casting between two complex types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-12 19:09:10 +00:00
Pearu Peterson
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
Nikita Shulga
4ee8aac432 [MPS] Enable bfloat16 support on MacOS 14 (#119641)
Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later

Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h`

TODO: Enable more testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641
Approved by: https://github.com/Skylion007
2024-02-11 16:25:29 +00:00
Nikita Shulga
1d61011c11 [MPS] Add support for complex scalars (#119318)
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float)  and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`

Fixes https://github.com/pytorch/pytorch/issues/119088

Test plan: CI (have quite a lot of tests, see new unexpected successes) +  `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
2024-02-08 18:10:59 +00:00
watarungurunnn
d444a3b443 [MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771)
Fixes #114285

(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
2024-02-05 15:36:55 +00:00
lancerts
26a2743162 Fix placeholder tensor is empty for relu in mps (#118965)
Fixes #118845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965
Approved by: https://github.com/malfet
2024-02-03 23:50:35 +00:00
Nikita Shulga
24dd9f42ce [MPS] Fix use_metal_mm condition (#118830)
One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)`

Extend test to account for this situation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830
Approved by: https://github.com/huydhn
2024-02-01 17:53:42 +00:00
Yifu Wang
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
Sun, Jiayi
2dd4a254a0 add Half support for interpolate operators on CPU (#105648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 09:07:16 +00:00
Nikita Shulga
1872834247 [MPS] Fix torch.mm correctness for large matrices (#117549)
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
  NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
  for (int64_t i = 0; i < M; i += tile_size) {
    const auto i_end = std::min(i + tile_size, M);
    NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
    for (int64_t j = 0; j < K; j += tile_size) {
      const auto j_end = std::min(j + tile_size, K);
      MPSGraphTensor* tile = nil;
      for (int64_t k = 0; k < N; k += tile_size) {
        const auto k_end = std::min(k + tile_size, N);
        auto selfChunk = [graph sliceTensor:selfTensor
                                     starts:@[ @(i), @(k) ]
                                       ends:@[ @(i_end), @(k_end) ]
                                    strides:@[ @(1), @(1) ]
                                       name:nil];
        auto otherChunk = [graph sliceTensor:otherTensor
                                      starts:@[ @(k), @(j) ]
                                        ends:@[ @(k_end), @(j_end) ]
                                     strides:@[ @(1), @(1) ]
                                        name:nil];
        auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];

        tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
      }
      [row_chunks addObject:tile];
    }
    auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
    [rows addObject:row];
  }
  return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```

One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
2024-01-17 01:33:08 +00:00
Nikita Shulga
6784030df4 [MPS] Add support for 64-bit index operations (#116942)
But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940
2024-01-09 16:56:49 +00:00
Nikita Shulga
ff0f79d3c7 [MPS] Mark torch.[all|any] as working with complex on MacOS14 (#116907)
It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907
Approved by: https://github.com/osalpekar, https://github.com/kit1980
2024-01-06 01:10:11 +00:00
Nikita Shulga
b0393ebe9b [MPS] Make test_mps.py passable on Sonoma (#116764)
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
2024-01-05 00:25:47 +00:00
Gao Tianlin
6793b99107 [BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457)
The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault.

If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal:
```
[call] op=[aten::quantize_per_tensor], key=[AutogradCPU]
  [redispatch] op=[aten::quantize_per_tensor], key=[CPU]
 [call] op=[aten::any.dims], key=[AutogradCPU]
  [redispatch] op=[aten::any.dims], key=[QuantizedCPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::any.dims_out], key=[QuantizedCPU]
    [call] op=[aten::any.dims], key=[QuantizedCPU]
     [call] op=[aten::empty.memory_format], key=[BackendSelect]
      [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::any.dims_out], key=[QuantizedCPU]
      [call] op=[aten::any.dims], key=[QuantizedCPU]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
       [call] op=[aten::any.dims_out], key=[QuantizedCPU]
        [call] op=[aten::any.dims], key=[QuantizedCPU]
         [call] op=[aten::empty.memory_format], key=[BackendSelect]
          [redispatch] op=[aten::empty.memory_format], key=[CPU]
         [call] op=[aten::any.dims_out], key=[QuantizedCPU]
          [call] op=[aten::any.dims], key=[QuantizedCPU]
           [call] op=[aten::empty.memory_format], key=[BackendSelect]
            [redispatch] op=[aten::empty.memory_format], key=[CPU]
           [call] op=[aten::any.dims_out], key=[QuantizedCPU]
            [call] op=[aten::any.dims], key=[QuantizedCPU]
             [call] op=[aten::empty.memory_format], key=[BackendSelect]
              [redispatch] op=[aten::empty.memory_format], key=[CPU]
             [call] op=[aten::any.dims_out], key=[QuantizedCPU]
              [call] op=[aten::any.dims], key=[QuantizedCPU]
               [call] op=[aten::empty.memory_format], key=[BackendSelect]
                [redispatch] op=[aten::empty.memory_format], key=[CPU]
               [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                [call] op=[aten::any.dims], key=[QuantizedCPU]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                  [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                  [call] op=[aten::any.dims], key=[QuantizedCPU]
.....
.....
.....
```

Fixes #116452
Fixes #116451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457
Approved by: https://github.com/malfet
2024-01-04 17:37:17 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Nikita Shulga
09ee96b69d [MPS] Fix CrossEntropyLoss for float16 (#116597)
Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e.
> `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)`

Alas, at the moment  it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU)

Fixes https://github.com/pytorch/pytorch/issues/116095

TODO: Enable testing via TestModules, but will do in separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597
Approved by: https://github.com/kulinseth
2024-01-03 05:58:26 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
PyTorch MergeBot
0978482afa Revert "Implement aten::upsample_linear1d on mps (#115031)"
This reverts commit c6969cb8a9.

Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))
2023-12-26 18:01:49 +00:00
Kai
c6969cb8a9 Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2023-12-26 15:44:21 +00:00
Aaron Gokaslan
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
Sun, Jiayi
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
Nikita Shulga
9dda4b20a0 [MPS] Enable select/[broad]cast ops for complex dtypes (#115727)
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
2023-12-19 02:25:28 +00:00
Peter Pham
74dfdc567b [MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801)
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706

The follow test would fail without this bug fix:

```
import torch
def test_erfinv():
    for device in ['cpu', 'mps']:
        x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
        y = x[2:].erfinv()

        x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
        y2 = x2.erfinv()

        print(y)
        print(y2)

        torch.testing.assert_close(y, y2)
        print(f"{device} passes.")

test_erfinv()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
2023-12-15 23:14:03 +00:00
Lucas Steuernagel
2e517b20d9 [MPS] Add Conv3D support for MPS (#114183)
Fixes #77818

I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:

> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)

Please, let me know if there are any other changes needed and I'll be happy to implement them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
2023-12-15 23:05:01 +00:00
mingfeima
a8acd6c410 Add Half support for AvgPool2d on CPU (#109578)
Add Half support for AvgPool2d (both channels last and channels first) on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-12-12 12:59:47 +00:00
igm503
f017a1af3f [MPS] add complex_out to MPS backend (#110851)
Adds support for at::complex_out to the MPS backend

Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851
Approved by: https://github.com/kulinseth
2023-12-11 13:37:55 +00:00
Li-Huai (Allan) Lin
38e1440bae [MPS] Remove redundant topk test and move all pad tests inside a class (#113313)
Summary:
1. The removed `topk` test is essentially very similar to the following test, so I remove it:
```python
def test_topk(self):
        def helper(shape):
            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
            x = cpu_x.detach().clone().to('mps')
            for largest_val in [True, False]:
                if (type(shape) == tuple):
                    for curr_dim in range(0, len(shape)):
                        dim_size = shape[curr_dim]
                        for k in range(1, dim_size + 1):
                            topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val)
                            topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val)
                            self.assertEqual(topk_values, topk_values_cpu)
                            self.assertEqual(topk_indices, topk_indices_cpu)
                else:
                    for k in range(1, shape):
                        topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val)
                        topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val)
                        self.assertEqual(topk_values, topk_values_cpu)
                        self.assertEqual(topk_indices, topk_indices_cpu)

        helper(2)
        helper((5, 1))
        helper((1, 5))
        helper((5, 9, 7, 4))
        helper((50, 20, 7, 4))
```
297c26bb8e/test/test_mps.py (L8054-L8091)

2. Move all pad tests to one standalone class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313
Approved by: https://github.com/kulinseth
ghstack dependencies: #113312
2023-12-01 06:52:07 +00:00
Li-Huai (Allan) Lin
88a659e752 [MPS] Move non-nll loss tests outside TestNLLLoss (#113312)
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
2023-12-01 06:52:07 +00:00
Nikita Shulga
1b27eae65e [MPS] Fix out-of-bounds fill to sliced tensor (#114838)
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes https://github.com/pytorch/pytorch/issues/114692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
2023-12-01 06:24:42 +00:00
Khushi Agrawal
cff84871ce [reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#114589)
Previous PR: #113885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589
Approved by: https://github.com/lezcano
2023-11-27 14:45:44 +00:00
PyTorch MergeBot
150aaf46ca Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)"
This reverts commit 4fa1ff8404.

Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk 4fa1ff8404 ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473))
2023-11-27 07:33:00 +00:00
Khushi Agrawal
4fa1ff8404 [opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)
Previous PR: https://github.com/pytorch/pytorch/pull/85202

Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885
Approved by: https://github.com/lezcano
2023-11-26 13:44:30 +00:00
Nikita Shulga
324cde59b2 [MPS] Fix test_copy_cast_no_leak (#114313)
When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization.

Fixes https://github.com/pytorch/pytorch/issues/114305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313
Approved by: https://github.com/huydhn
2023-11-22 14:48:24 +00:00
Nikita Shulga
b5dd37f23e [MPS] Fix memory leak in copy_from_mps_ (#114197)
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL

======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
    self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory

To execute this test, run the following from the base repo dir:
     python test/test_mps.py -k test_copy_cast_no_leak

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 1.102s

FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
...
```

Fixes https://github.com/pytorch/pytorch/issues/114096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
2023-11-21 14:52:55 +00:00
Li-Huai (Allan) Lin
538114db65 [MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085)
Fixes #100764

This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.

For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.

For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125

See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
2023-11-14 22:03:21 +00:00
Nikita Shulga
265d6aac0b [MPS] Fix crashes during Conv backward pass (#113398)
By adding weights tensor to the MPSGraph cache key.
Add regression test to validate that collision no longer happens

Fixes https://github.com/pytorch/pytorch/issues/112998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398
Approved by: https://github.com/kulinseth
2023-11-10 04:29:33 +00:00
Li-Huai (Allan) Lin
740137df6f [MPS] Add bucketize op (#112830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #112829
2023-11-07 17:22:08 +00:00
Li-Huai (Allan) Lin
c4bb77323d [MPS] Add searchsorted op (#112829)
The metal kernels implemented are closely following `Bucketization.cu`.

Benchmark:
```
[----------------------------- searchsorted ----------------------------]
                                                         |  cpu   |  mps
1 threads: --------------------------------------------------------------
      Batch size: 8; In features: 64; Sorter: True       |    44  |   530
      Batch size: 8; In features: 64; Sorter: False      |    31  |    12
      Batch size: 8; In features: 256; Sorter: True      |   131  |   520
      Batch size: 8; In features: 256; Sorter: False     |   107  |    12
      Batch size: 8; In features: 1024; Sorter: True     |   499  |   590
      Batch size: 8; In features: 1024; Sorter: False    |   398  |    12
      Batch size: 16; In features: 64; Sorter: True      |    71  |   540
      Batch size: 16; In features: 64; Sorter: False     |    57  |    12
      Batch size: 16; In features: 256; Sorter: True     |   242  |   610
      Batch size: 16; In features: 256; Sorter: False    |   200  |    12
      Batch size: 16; In features: 1024; Sorter: True    |   999  |   720
      Batch size: 16; In features: 1024; Sorter: False   |   842  |    12
      Batch size: 32; In features: 64; Sorter: True      |   124  |   509
      Batch size: 32; In features: 64; Sorter: False     |   103  |    12
      Batch size: 32; In features: 256; Sorter: True     |   477  |   650
      Batch size: 32; In features: 256; Sorter: False    |   407  |    12
      Batch size: 32; In features: 1024; Sorter: True    |  1940  |   833
      Batch size: 32; In features: 1024; Sorter: False   |  1710  |    12
      Batch size: 64; In features: 64; Sorter: True      |   231  |   590
      Batch size: 64; In features: 64; Sorter: False     |   194  |    12
      Batch size: 64; In features: 256; Sorter: True     |   937  |   710
      Batch size: 64; In features: 256; Sorter: False    |   800  |    13
      Batch size: 64; In features: 1024; Sorter: True    |  3980  |  1290
      Batch size: 64; In features: 1024; Sorter: False   |  3330  |    12
      Batch size: 128; In features: 64; Sorter: True     |   448  |   650
      Batch size: 128; In features: 64; Sorter: False    |   390  |    13
      Batch size: 128; In features: 256; Sorter: True    |  1830  |   850
      Batch size: 128; In features: 256; Sorter: False   |  1590  |    12
      Batch size: 128; In features: 1024; Sorter: True   |  7790  |  2850
      Batch size: 128; In features: 1024; Sorter: False  |  6670  |    13
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829
Approved by: https://github.com/malfet
2023-11-07 17:22:08 +00:00
CaoE
455241bbd3 Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138)
Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138
Approved by: https://github.com/cpuhrsch
2023-11-06 06:01:29 +00:00
CaoE
26b5e27ace Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132)
Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132
Approved by: https://github.com/cpuhrsch
2023-11-05 12:31:38 +00:00
Li-Huai (Allan) Lin
30237aaeec [MPS] Fix bug when value is of complex (#111937)
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
2023-10-31 17:50:56 +00:00
CaoE
a310cc8968 Add Half support for kthvalue, cross, hist, and logit on CPU (#112135)
Add Half support for kthvalue, cross, hist, and logit on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135
Approved by: https://github.com/cpuhrsch
2023-10-31 09:12:47 +00:00
Peter Bell
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
Cao E
1c89ea7f72 Add Half support for softmax and log_softmax on CPU (#103315)
Add Half support for softmax and log_softmax on CPU.
Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet
2023-10-26 08:38:54 +00:00
Peter Bell
46e80ce58a [ATen] Support multi dim any and all reductions (#110310)
This adds a new overload to `all` and `any` with support for multiple reduction dims.
```
all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/justinchuby
2023-10-24 21:33:53 +00:00
Li-Huai (Allan) Lin
4b804dac33 [MPS] Add complex support for fill (#111885)
Fixes #110537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885
Approved by: https://github.com/malfet
2023-10-24 06:41:10 +00:00
CaoE
4b324a8717 Add Half support for aminmax on CPU (#106853)
Add Half support for aminmax on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853
Approved by: https://github.com/cpuhrsch
2023-10-23 17:43:47 +00:00
CaoE
d1afb7d43d add Half support for multinomial on CPU (#104178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
CaoE
2a40b7efcb Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319)
Add Half support for addcmul, addcdiv, cumsum, and topk on CPU.
Note: This PR will introduce the issue  https://github.com/pytorch/pytorch/issues/111454.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-10-19 17:47:45 +00:00
CaoE
8713a1a363 add Half support for bernoulli on CPU (#104176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00
Kurt Mohler
5292a92e03 Add torch.unravel_index (#110580)
Fixes #35674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580
Approved by: https://github.com/lezcano, https://github.com/kulinseth
2023-10-12 00:55:51 +00:00
igm503
95ff51d8ed [MPS] Add support for Softshrink to MPS Backend (#110814)
Adds the softshrink activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814
Approved by: https://github.com/kulinseth
2023-10-11 07:55:39 +00:00
igm503
4b881b0da3 [MPS] add support for sgn to MPS backend (#110829)
Fixes #86805

Adds support for sgn to MPS backend.

Notes:

1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.

2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
2023-10-09 16:53:25 +00:00
vfdev-5
d2a2a67fa4 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-09 10:55:56 +00:00
igm503
a389181f2e [MPS] add support for aten::nextafter (#109685)
Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591

Adds support for aten::nextafter to the MPS backend. Supports float and half types.

Notes:
- I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions
- Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685
Approved by: https://github.com/kulinseth
2023-10-03 19:20:22 +00:00
PyTorch MergeBot
df3ab70dde Revert "Added new test sample to interpolate op in OpInfo (#104181)"
This reverts commit 87f8bc65f8.

Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))
2023-10-03 18:07:02 +00:00
vfdev-5
87f8bc65f8 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-02 15:35:48 +00:00
CaoE
9399e0b1ff add fp16 support for gemm (#99498)
### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401

56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-09-28 01:03:50 +00:00
Li-Huai (Allan) Lin
ac1e85161e [MPS] Fix nll_loss with default ignore_index (#109574)
`-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies.

Fixes #108148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574
Approved by: https://github.com/kulinseth
ghstack dependencies: #109557
2023-09-26 04:13:09 +00:00
Li-Huai (Allan) Lin
0087118997 [MPS] Fix mps to cpu copy with storage offset (#109557)
Fix #108978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557
Approved by: https://github.com/DenisVieriu97
2023-09-26 04:13:08 +00:00
CaoE
7c9052165a add fp16 support for native conv and deconv on CPU (#99497)
### Testing

Native conv vs. mkldnn conv on SPR (with avx512_fp16 support)

Single core:

Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 34676789 | 524199.8 | 66.15185
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 33454125 | 349844.4 | 95.62573
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 317650.1 | 2317.677 | 137.0554
IC: 128, OC: 256, kernel: 3, stride: 1,   N: 1, L: 64 | 15334.68 | 167.264 | 91.67952

56 cores:
Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 1032064 | 11073.58 | 93.20061
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 1000097 | 16371.19 | 61.08883
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 981813.4 | 9008.908 | 108.9825
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 1082606 | 10150.47 | 106.6558
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 319980.6 | 181.598 | 1762.027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-09-25 01:31:26 +00:00
igm503
255d1a776a [MPS] Add support for Mish to MPS backend (#109786)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444)

Adds the mish activation function to the mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786
Approved by: https://github.com/kulinseth
2023-09-21 21:01:20 +00:00
igm503
0317626df5 [MPS] adding weight_norm_interface support for mps (#108008)
Fixes #104513

Adds support for aten::_weight_norm_interface to the mps backend.

Also adds a consistency test for the output and the grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008
Approved by: https://github.com/kulinseth
2023-09-20 02:18:28 +00:00
CaoE
54c28c564f add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima
2023-09-19 10:43:33 +00:00
PyTorch MergeBot
be9f73f031 Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211)"
This reverts commit fe14e43d14.

Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))
2023-09-14 22:29:12 +00:00
Edward Z. Yang
fe14e43d14 Add meta and OpInfo for _embedding_bag_dense_backward (#109211)
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula.  Check comments.

This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-09-14 18:49:32 +00:00
PyTorch MergeBot
b226373d16 Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit b6a1d3fb97.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main b6a1d3fb97 https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))
2023-09-14 16:13:34 +00:00
CaoE
b6a1d3fb97 add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-14 12:23:59 +00:00
PyTorch MergeBot
04a765f95d Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit 6065e7a97c.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` 6065e7a97c https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))
2023-09-13 22:38:42 +00:00
Nikita Shulga
916183a012 [MPS] Fix crash if nonzero is called concurrently (#108996)
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`,  which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.

Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device

Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>

> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
2023-09-13 19:28:47 +00:00
CaoE
6065e7a97c add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-13 17:30:16 +00:00
igm503
1b9b3a2d15 [MPS] Adding lgamma, digamma, and polygamma implementations (#106292)
Fixes issue mentioned in #77764

e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744

Adds MPS support for the following ops:

- lgamma
- mvlgamma
- digamma
- polygamma

The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.

NOTE:

The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)

The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
2023-09-12 16:43:37 +00:00
Li-Huai (Allan) Lin
293d3b89d8 Add Opinfos for the Tensor overload of linspace/logspace (#107958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958
Approved by: https://github.com/zou3519
2023-09-11 22:30:19 +00:00
Nikita Shulga
9b12a28d89 [MPS] Implement mul operation for complex types (#108395)
Using existing BinaryKernel template

Add `mul` as well as `kron` and `outer` to list of MPS ops that support complex types

This should add all the missing ops mentioned in https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108395
Approved by: https://github.com/albanD
ghstack dependencies: #108393, #108394
2023-09-10 05:39:12 +00:00
Nikita Shulga
c7bb842d35 [MPS] Add complex add/sub (#108394)
Using `view_as_real` and running elementwise ops in resulted tensors
Add `add` and `sub` to list of complex ops that should work on MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108394
Approved by: https://github.com/albanD
ghstack dependencies: #108393
2023-09-10 05:39:12 +00:00
Nikita Shulga
53a4ca4b58 [MPS][BE] Add dispatch_sync_with_rethrow (#108393)
And enable testing for match_output for complex types.
Most of them should throw an "unsupported XYZ" error, rather than crash.
This fixed several crashes when linalg ops were invoked with complex inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108393
Approved by: https://github.com/kit1980, https://github.com/kulinseth
2023-09-10 02:07:12 +00:00
alexdremov
b60273b88a [MPS] Pixel shuffle unshuffle support (#99306)
Fixes #83196

Now, MPS implementation is blazingly fast.

Though, I have several questions on improving this PR:

1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-09-06 09:11:39 +00:00
CaoE
42f94d7e9f add Half support for maxpool on CPU (#98819)
### Testing
Single socket (28 cores):

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98819
Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki
2023-09-05 18:23:41 +00:00
Nikita Shulga
bae409388c [MPS] Fix .item() for multi-dim scalar (#107913)
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>

> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_

Fixes https://github.com/pytorch/pytorch/issues/107867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
2023-08-31 21:08:29 +00:00
vfdev
b7624fc91e Cleaned up test_mps.py::test_output*_match (#108092)
Description:
- cleaned up test_mps.py::test_output_match and test_mps.py::test_output_grad_match tests
  - removed unused variables and useless brackets
  - simplified atol/rtol setup if/else code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108092
Approved by: https://github.com/kulinseth
2023-08-29 10:46:02 +00:00
Nikita Shulga
6e85a68829 [MPS] Implement polar via metal shader (#107324)
Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator.

Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported

Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block.
Fixes https://github.com/pytorch/pytorch/issues/78503

TODOs(in followup PRs):
  - Implement backwards (requires complex mul and sgn)
  - Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR)

Partially addresses https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324
Approved by: https://github.com/albanD
2023-08-25 03:16:23 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
arunppsg
4bfc55ba8b [MPS] Enable forward test for renorm (#106666)
Enabled forward test for renorm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106666
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-08-17 16:46:06 +00:00
Jason Lu
bc88028e8e Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743)
Summary:
Original commit changeset: 81319beb97f3

Original Phabricator Diff: D47961182

Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822

Reviewed By: atuljangra

Differential Revision: D48131623

@diff-train-skip-merge
(D48131623 landed internally)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743
Approved by: https://github.com/malfet
2023-08-08 15:27:34 +00:00
Ramin Azarmehr
cdfd0ea162 [MPS] Introduce torch.mps.Event() APIs (#102121)
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-08-08 03:45:45 +00:00
Li-Huai (Allan) Lin
d4d086ce7b [MPS] Fix Clamp with strided outputs/inputs (#97858)
Fixes #94396
Fixes #87348

1. If output is strided, we don't gather input tensors.
2. If output is not strided but min_t or max_t is strided, we make min_t or max_t contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97858
Approved by: https://github.com/kulinseth
2023-08-04 09:32:12 +00:00
Peter Stefek
c9c2b14c53 Fix copy_ broadcast behavior on mps (#105617)
Fixes #105277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105617
Approved by: https://github.com/malfet
2023-08-03 04:03:32 +00:00
PyTorch MergeBot
d83b887f2a Revert "Add error checking for padding modules (#106147)"
This reverts commit 0547b6279d.

Reverted https://github.com/pytorch/pytorch/pull/106147 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal builds, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/106147#issuecomment-1661870970))
2023-08-02 09:37:40 +00:00
Denis Vieriu
d1a2aa1909 [MPS] Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747)
- Fix the FP16 clamp issue (FP32 and FP16 are not broadcast compatible)
- Fix clamp (cached graph nodes were previously replaced with the cast version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105747
Approved by: https://github.com/kulinseth
2023-08-02 02:51:34 +00:00
Peter Stefek
97e5055a69 Add cumprod support for device mps (#104688)
Related to #77764

Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688
Approved by: https://github.com/kulinseth
2023-08-01 21:51:20 +00:00
Mikayla Gawarecki
0547b6279d Add error checking for padding modules (#106147)
Fixes https://github.com/pytorch/pytorch/issues/105627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106147
Approved by: https://github.com/albanD
ghstack dependencies: #106325
2023-08-01 12:49:58 +00:00
Mikayla Gawarecki
d8e5f2aa6d Reland "Make adding buffers more like adding parameters (#104069)" (#106224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224
Approved by: https://github.com/atalman, https://github.com/albanD
2023-07-31 17:18:56 +00:00
cyy
b8eb827d93 use UBSAN on some tests (#103655)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103655
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-07-24 14:24:49 +00:00
Peter Pham
bba06ad751 [MPS] aten::erfinv metal kernel ops (#101507)
I've added the implementation of erfinv using the algorithm from 4154c8ea15/aten/src/ATen/native/Math.h (L152) in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661

Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")

# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x:  {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time:  2.654937833400254
MPS torch.erfinv time:  0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv:  tensor(4.2339e-14)
Largest difference
x:  -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06

Fixes #https://github.com/pytorch/pytorch/issues/86808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507
Approved by: https://github.com/kulinseth
2023-07-23 01:36:43 +00:00
Jane Xu
803d42e457 add lerp cpu support for half (#105607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607
Approved by: https://github.com/albanD
2023-07-21 20:29:05 +00:00
Andrey Talman
c6653b65d8 Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/

with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`

Original commit changeset: d4b4069fbd38

Original Phabricator Diff: D47537831

Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```

Reviewed By: atalman

Differential Revision: D47600140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Peter Stefek
d2c24eca8a Fix mps unary op issue on non densely stored tensors (#105512)
This pr fixes a bug where non densely stored tensors were not converted to the dense tensors of the correct scalar type in the mps `unary_op` helper function

Fixes https://github.com/pytorch/pytorch/issues/105284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105512
Approved by: https://github.com/malfet
2023-07-19 03:56:38 +00:00
Nikita Shulga
8cd94e1eab [MPS] Add lerp implementation (#105470)
lerp.Scalar fits very well into binary op template
Add a very naive implementation for `lerp.Tensor` as `add_out(self, weights.mul(end.sub(self)))`

Enable `lerp` testing in `test_mps`

Fixes https://github.com/pytorch/pytorch/issues/105382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105470
Approved by: https://github.com/albanD
2023-07-18 20:01:04 +00:00
ekamiti
32d422f335 Make adding buffers more like adding parameters (#104069)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
David Radley
17250976f3 correct empty tensor mps all operation (#105218)
Fixes #104694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105218
Approved by: https://github.com/ezyang, https://github.com/kulinseth
2023-07-14 17:42:54 +00:00
albanD
08cbfb2a58 Avoid tensor creation and use scalar overload (#104264)
I would expect this preserves the behavior but there might be weird edge cases?
@mruberry might know?

The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264
Approved by: https://github.com/zou3519
2023-07-12 18:11:27 +00:00
Nikita Shulga
5e4ee15e85 [MPS] Fix unique flatten logic (#104938)
Tensor must be flatted if dim is none before checking whether or not dim dimension is already None

Fixes https://github.com/pytorch/pytorch/issues/104879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104938
Approved by: https://github.com/albanD
2023-07-11 19:55:56 +00:00
soulitzer
91dcc3b272 Fix activation checkpoint for mps (#104787)
Fixes https://github.com/pytorch/pytorch/issues/104478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104787
Approved by: https://github.com/albanD
2023-07-08 14:57:05 +00:00
Jerry Zhang
611febf6cf [quant] Support integer implementations for max_pool2d (#104225)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225
Approved by: https://github.com/kimishpatel
2023-07-05 23:54:07 +00:00
Nikita Shulga
01e6d64dd2 [MPS] Fix unary ops over sparse-mapped tensors (#100765)
If input tensor is backed by a sparse view, create a dense copy before running unary op, otherwise op will be applied against the wrong elements.
Introduce `is_dense_in_storage` that returns true if tensor/view are mapped to a dense area in  the tensor storage.
Add unit test to validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/98074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100765
Approved by: https://github.com/albanD
2023-07-05 23:17:43 +00:00
Denis Vieriu
28720ad585 Fix argmax and argmin clamp value on MPS (#104374)
Replace clamp `LLONG_MAX` clamp value with the largest integer value that can be stored in a double. `constantWithScalar` takes as input a `double` value, for which `LLONG_MAX` was not fitting in a dobule, resulting in failures on x86.

Fixes https://github.com/pytorch/pytorch/issues/98191, https://github.com/pytorch/pytorch/issues/92311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104374
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-06-30 18:11:49 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00
magic-akari
e56cdfd74b [MPS] Handle deserialization more permissively (#98834)
MPS deserialization should handle `mps:0`.
It is generated from some codes like the following

```python
torch.rand(size=(3, 4)).to("mps")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98834
Approved by: https://github.com/kulinseth, https://github.com/kit1980, https://github.com/malfet
2023-06-15 15:51:03 +00:00
Pearu Peterson
45401ef745 Enable float16 and complex32 support for sparse CSR elementwise multiplication operation. (#100394)
As in the title. In addition, the PR adds float16 addcmul support for CPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100394
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-06-14 14:42:39 +00:00
Li-Huai (Allan) Lin
cce58a43c9 [MPS] Fix softplus with f16 input (#101948)
Fixes #101946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101948
Approved by: https://github.com/malfet
2023-05-31 00:40:10 +00:00
ecao
3f4fee735a add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel
2023-05-27 16:20:21 +00:00
Li-Huai (Allan) Lin
0db704d240 [OpInfo] Add multi_head_attention_forward (#100153)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8f8d620</samp>

This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153
Approved by: https://github.com/drisspg
2023-05-26 01:58:17 +00:00
Denis Vieriu
de7ec2ddd7 [MPS] Allow saved models to be loaded directly to MPS through torch.jit.load (#102204)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 94eed69</samp>

This pull request adds support for serializing and deserializing tensors on the `mps` device using JIT. It includes a test case in `test/test_mps.py` and a device handling logic in `torch/csrc/jit/serialization/unpickler.cpp`.

Fixes https://github.com/pytorch/pytorch/issues/88820, https://github.com/pytorch/pytorch/issues/87504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102204
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-25 23:32:29 +00:00
Li-Huai (Allan) Lin
02a7318a5b [MPS] Add aminmax op (#101691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101691
Approved by: https://github.com/malfet
2023-05-23 18:01:34 +00:00
Li-Huai (Allan) Lin
330c907301 [MPS] Fix embedding cache key (#101857)
Fixes #101198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101857
Approved by: https://github.com/kulinseth
2023-05-21 06:11:25 +00:00
Aaron Gokaslan
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
Khushi
1aaf0396eb [reland][opinfo] empty_strided (#101782)
Follows #100223

Previous PR: #100890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101782
Approved by: https://github.com/ezyang
2023-05-19 03:06:29 +00:00
PyTorch MergeBot
dfac4364c4 Revert "[opinfo] empty_strided (#100890)"
This reverts commit 01c7106580.

Reverted https://github.com/pytorch/pytorch/pull/100890 on behalf of https://github.com/PaliC due to broke test_ops.py slow test ([comment](https://github.com/pytorch/pytorch/pull/100890#issuecomment-1551903975))
2023-05-17 19:00:15 +00:00
Li-Huai (Allan) Lin
bb3558961f [MPS] Add histogram ops (#96652)
Adds `torch.histc`, `torch.histogram`, `torch.histogramdd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96652
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-17 01:25:43 +00:00
Khushi
01c7106580 [opinfo] empty_strided (#100890)
Follows: #100223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100890
Approved by: https://github.com/ezyang
2023-05-15 23:39:39 +00:00
Nikita Shulga
9e089db32e [MPS] Enable arange for int8 and uint8 dtypes (#101303)
Not sure, why it was not enabled previously.
Sort types in `AT_DISPATCH_MPS_TYPES` by group (floats first then integers) and size.
Test implicitly in `test_bernoulli`.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 80c7ed7</samp>

> _`Char` and `Byte` types_
> _MPS can dispatch them now_
> _Winter of tensors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101303
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/kulinseth
2023-05-13 01:19:08 +00:00
Ramin Azarmehr
0be53d83fc [MPS] Add support for MPSProfiler Python bindings (#101002)
- Added torch.mps.profiler.[start() and stop()] APIs with RST documentation
- Added test case in test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101002
Approved by: https://github.com/malfet
2023-05-12 21:55:34 +00:00
Sun, Jiayi
d56e1b2f67 add Half support for unary ops on CPU (#98493)
Add Half support for log_sigmoid and some unary ops on CPU, including sinc, acosh, asinh, atanh, digamma, trigamma, rsqrt, acos, asin, atan, ceil, cos, erf, erfc, erfinv, exp, expml, floor, log, log10, log1p, log2, i0, round, sin, sqrt, tan, tanh, trunc, lgamma.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98493
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel
2023-05-12 04:52:34 +00:00
Nikita Shulga
b7bf953bbc [MPS] Fix bernoulli for int types (#100946)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 069fd23</samp>

This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch.

Fixes https://github.com/pytorch/pytorch/issues/100717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946
Approved by: https://github.com/kulinseth
2023-05-11 23:52:38 +00:00
Nikita Shulga
87084643e5 [CI][MPS] Actually make grid_sampler_2d available (#101108)
In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value.
MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s).

Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108
Approved by: https://github.com/kulinseth
2023-05-11 10:35:09 +00:00
Khushi
51fe53e619 [opinfo] item (#100313)
Follows #100223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100313
Approved by: https://github.com/ezyang
2023-05-10 11:32:45 +00:00
Ramin Azarmehr
cecfcf1e17 [MPS] Handle MPS failures of test_modules.py in common_modules.py (#95334)
- Also cleaned up `test_modules.py` from skipMPS code.
- Added `skipMPS` for unsupported or failing tests on MPS backend in common_modules.py.
   (We'll remove `skipMPS` from those tests once a fix is available for them.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95334
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-05-09 03:55:16 +00:00
Li-Huai (Allan) Lin
3b6a7f4d51 [MPS] Fix index_put with deterministic algorithm enabled (#97660)
Prevent using parallel computing when deterministic algorithm is set.

Fixes #97574

Benchmark:
```
[--------------- index_put_ Deterministic Algorithm Enabled ---------------]
                                                              |  cpu  |  mps
1 threads: -----------------------------------------------------------------
      Dtype: torch.float32 Features: 1024; Num Indices: 512   |   37  |   49
      Dtype: torch.float32 Features: 1024; Num Indices: 1024  |   54  |   50
      Dtype: torch.float32 Features: 1024; Num Indices: 2048  |   86  |   50
      Dtype: torch.float32 Features: 1024; Num Indices: 4096  |  150  |   49

Times are in microseconds (us).

[-------------- index_put_ Deterministic Algorithm Disabled ---------------]
                                                              |  cpu  |  mps
1 threads: -----------------------------------------------------------------
      DType: torch.float32 Features: 1024; Num Indices: 512   |   37  |   49
      DType: torch.float32 Features: 1024; Num Indices: 1024  |   53  |   49
      DType: torch.float32 Features: 1024; Num Indices: 2048  |   86  |   49
      DType: torch.float32 Features: 1024; Num Indices: 4096  |  147  |   50

Times are in microseconds (us).
```

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at ebf2ff3</samp>

Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660
Approved by: https://github.com/kulinseth
2023-05-08 00:57:29 +00:00
Kulin Seth
e20c94bda9 [MPS] Add the test for 5D in test_mps which is skipped. (#99271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99271
Approved by: https://github.com/DenisVieriu97
2023-05-05 22:57:06 +00:00
Li-Huai (Allan) Lin
13da6585b6 [MPS] Skip all empty ops tests (#100368)
Fixes #100175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100368
Approved by: https://github.com/kulinseth
2023-05-02 00:43:58 +00:00
Li-Huai (Allan) Lin
a50fb50c51 [MPS] Fix exception regex not compared (#100367)
Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367
Approved by: https://github.com/albanD
2023-05-02 00:43:58 +00:00
Nikita Shulga
2442858f52 [MPS] Fix layer_norm_backward_mps key (#100295)
Followup after https://github.com/pytorch/pytorch/pull/98794
See report in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1527312211 and reproducer in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1528214175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100295
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb
2023-04-29 03:37:35 +00:00
Li-Huai (Allan) Lin
81978120ec [MPS] Fix trace exceptions not raised for error inputs (#99239)
Also rename `trace_mps_out` to `trace_mps` as it is not an out version.

Remove `index_add` from XFAILLIST as it seems working as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99239
Approved by: https://github.com/kulinseth
2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin
f4a37c9a5d [MPS] Fix max_pool2d exceptions not raised for error inputs (#99238)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99238
Approved by: https://github.com/kulinseth
2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin
f4cf744380 [MPS] Fix gelu exceptions not raised for error inputs (#99237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99237
Approved by: https://github.com/kulinseth
2023-04-26 14:41:46 +00:00
Li-Huai (Allan) Lin
1fcf40da63 [MPS] Add linear inputs check (#99228)
Fixes #98211

https://github.com/pytorch/pytorch/issues/98211#issuecomment-1496005668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99228
Approved by: https://github.com/kit1980
2023-04-26 04:44:23 +00:00
Denis Vieriu
89baa1a74c [MPS] Add support for linalg.vector_norm (#99811)
Summary of changes:

- Add support for linalg.vector_norm
- Fix zero norm, correct formula is: sum(x != 0)
- Add additional tests in test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99811
Approved by: https://github.com/kulinseth
2023-04-26 01:34:29 +00:00
Justin Chu
79c9e82e27 Fix flake8 lint errors reported by ruff - take 2 (#99798)
Replaces #99784. This PR is pure autofix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99798
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-04-23 23:09:51 +00:00
BJ Hargrave
dc52ba2906 Fix test_mps for macos 13.3 (#98739)
Expected dtype is changed from torch.int64 to torch.int32 prior to
macos 13.3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98739
Approved by: https://github.com/kulinseth
2023-04-12 19:23:08 +00:00
Li-Huai (Allan) Lin
be8a4eb8e3 [MPS] Add index_fill op (#98694)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98694
Approved by: https://github.com/kulinseth
2023-04-12 18:13:33 +00:00
Li-Huai (Allan) Lin
71aea7f56e [MPS] Add error inputs check (#98167)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98167
Approved by: https://github.com/kulinseth
2023-04-12 17:19:13 +00:00
Nikita Shulga
583193e1d9 [MPS] Fix batch_norm_backwards key (#98794)
One needs different graphs for batch_norm_backwards depending whether or
not gradients are required for some of the params

Fixes https://github.com/pytorch/pytorch/issues/98602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98794
Approved by: https://github.com/kulinseth
2023-04-11 17:23:36 +00:00
Guang Yang
c377a8590b Add nonzero_static() op to pytorch to unblock export (#97417)
Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR

Example:

Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])`

call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])`
call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded)
call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated)

Test Plan:
**Unit Tests**
```
buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static'
```

**PT2 Export with `nonzero_static()`**
Example of `GraphModule` in the exported graph
```
def forward(self, x):
    arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
    nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4);  arg0 = None
    return pytree.tree_unflatten([nonzero_static_default], self._out_spec)
```

Differential Revision: D44324808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417
Approved by: https://github.com/ezyang
2023-04-11 05:13:36 +00:00
Nikita Shulga
29cde00701 [MPS] Add random_ overload (#98333)
That simply calls `torch.random_(from=0, to=None)`

Also, fix optional upper bound calculation for all `dtypes` but int64:
As one can see from https://pytorch.org/docs/stable/generated/torch.Tensor.random_.html
`from` boundary is inclusive, but `to` is exclusive, i.e. if `to` is
omitted for `torch.int8` dtype, it should be set to `128` and to `2`
for torch.bool.

Add test for `torch.random_`

Fixes https://github.com/pytorch/pytorch/issues/98118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98333
Approved by: https://github.com/kulinseth
2023-04-05 21:24:45 +00:00
Li-Huai (Allan) Lin
db8abde9b6 [MPS] Enable conditional indexing tests (#97871)
The tests seem to be working now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871
Approved by: https://github.com/kulinseth
2023-04-01 16:15:08 +00:00
Li-Huai (Allan) Lin
7776653a0c Add linear gradgrad (#97151)
Fixes #92206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97151
Approved by: https://github.com/albanD
2023-03-30 07:25:02 +00:00
Philip Meier
2f6c18d1a2 improve memory footprint of torch.testing.assert_close (#96131)
Redo of #90172 out of stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96131
Approved by: https://github.com/pearu, https://github.com/mruberry
2023-03-29 23:49:56 +00:00
Li-Huai (Allan) Lin
4afef85dda [MPS] Fix index_select_scalar test (#97773)
#96408 introduced a check that prevents the index to scalar from being non-singleton.

Fixes #94162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97773
Approved by: https://github.com/kulinseth
2023-03-28 19:23:59 +00:00
Li-Huai (Allan) Lin
100641aadf [MPS] Fix torch.eye unsupported bool constant on macOS 12 (#97027)
Fixes #91620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97027
Approved by: https://github.com/kulinseth
2023-03-20 18:08:36 +00:00
Ramin Azarmehr
50beab2978 [MPS] Fix the failure with ReplicatePad3D (#96988)
- Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad
- Added a test case
- The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988
Approved by: https://github.com/DenisVieriu97
2023-03-17 01:41:12 +00:00
alexdremov
62eb7a2e97 [MPS] LSTM grad_y missing fix (#96601)
Fixes #96416
Added tests that do not use LSTM output simalarly to the issue

Seems like this fix once again introduces backward incompatibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96601
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-03-16 15:53:56 +00:00
Li-Huai (Allan) Lin
c95bcb6694 [MPS] Fix flip where no dims need to be flipped (#96605)
Fixes #96558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96605
Approved by: https://github.com/kulinseth
2023-03-14 00:34:30 +00:00
Li-Huai (Allan) Lin
a87f3f612e [MPS] Fall back multi-layer LSTM on macOS 12 (#90909)
The native implementation of LSTM has been fixed on macOS 13.

On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side.

Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable.

Fixes: #90421
Issues related: #80306, #83144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-03-10 03:10:49 +00:00
Nikita Shulga
075a49442d [MPS] Allow float16 input to float32 LayerNorm (#96430)
Only for forward pass

Subset of https://github.com/pytorch/pytorch/pull/96208

Create constant with scalar using `input_mps_dtype` and use
`reciprocalWithTensor` instead of `divisionWithPrimaryTensor:1.0
secondaryTensor:`

Fixes https://github.com/pytorch/pytorch/issues/96113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96430
Approved by: https://github.com/kulinseth
2023-03-09 22:09:10 +00:00
Kulin Seth
2bb022e902 [MPS] Adding xfaillist with all categories of failures. (#96176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96176
Approved by: https://github.com/malfet
2023-03-08 08:41:21 +00:00
Catherine Lee
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
Li-Huai (Allan) Lin
2f66b57a7a [MPS] Fix in-place add and sub with alpha == 0.0 (#96184)
Apart from fixing the below issue, this PR integrates the test for `sub` into the test for `add` as they are implemented using the same template.

Fixes #96065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96184
Approved by: https://github.com/kulinseth
2023-03-07 17:17:53 +00:00
Nikita Shulga
769cc8a614 [MPS] Add type promotion to torch.addcmul (#96164)
Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"`

Modify `castMPSTensor` to become a no-op if cast is not needed

Define `common_dtype` as `c10::promoType` between self, tensor1 and
tensor2. Cast to any output type.

Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover
all the permutations

Discovered while looking at https://github.com/pytorch/pytorch/issues/96113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164
Approved by: https://github.com/kulinseth
2023-03-07 04:19:30 +00:00
alexdremov
78da315afd [MPS] Fix bidirectional LSTM & small one-direction LSTM fix (#95563)
Fixes #94754

With this PR I hope to finish my breathtaking journey of fixing MPS LSTM.

Here, I enable `bidirectional` on MPS. Also, I've noticed that cache key did not account for all parameters, so there could have been problems with one-directional LSTM when created without bias or dropout and then with one of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95563
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/malfet
2023-03-05 00:19:54 +00:00
Nikita Shulga
436993d52b [MPS] Error on unsupported types (#95982)
I.e. attempt to create tensor of all possible types and make sure that
it raises a structured error for non-MPS types

Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before)

Fixes https://github.com/pytorch/pytorch/issues/95976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982
Approved by: https://github.com/kulinseth
2023-03-04 01:29:07 +00:00
Denis Vieriu
304a95435d [MPS] Disallow reshape in slice (#95905)
Disallow reshapes for arrayViews.
Current code allows a base shape of `[2, 4, 256]` to be sliced into `[4, 1, 256]` (view's shape) - which is not possible. Slicing a smaller dimension into a bigger one will always error out.

Fixes https://github.com/pytorch/pytorch/issues/95883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95905
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-03-03 08:08:34 +00:00
Denis Vieriu
d0dd898943 [MPS] Remove remaining casts from 13.3 (#95870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95870
Approved by: https://github.com/kulinseth
2023-03-02 12:44:59 +00:00
Denis Vieriu
4d3352ed90 [MPS] Remove casts from reduction/cumsum/sort ops starting with macOS 13.3 (#95817)
MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817
Approved by: https://github.com/kulinseth
2023-03-02 00:26:24 +00:00
Kulin Seth
5d9d8c6154 [MPS] Add fixes for div with floor and raise error for div_trunc (#95769)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95769
Approved by: https://github.com/DenisVieriu97
2023-03-01 20:52:28 +00:00
Denis Vieriu
e5a959a2d4 [MPS] Fix views with 3 or more sliced dimensions (#95762)
Fixes https://github.com/pytorch/pytorch/issues/95482
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95762
Approved by: https://github.com/razarmehr
2023-03-01 16:16:49 +00:00
Denis Vieriu
ed1957dc19 [MPS] Add support for masked_scatter (#95743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95743
Approved by: https://github.com/kulinseth
2023-03-01 01:36:36 +00:00
Li-Huai (Allan) Lin
f33180fb7f [MPS] Add pow.Scalar (#95201)
1. Adds `pow.Scalar`.
2. Modifies testing `atol` and `rtol` to get pow output match tests pass.
3. Xfails numerically incorrect dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95201
Approved by: https://github.com/kulinseth
2023-02-28 16:11:15 +00:00
Li-Huai (Allan) Lin
9e16f1281f [MPS] Add copysign op. (#95552)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95552
Approved by: https://github.com/kulinseth
2023-02-28 06:49:46 +00:00
Li-Huai (Allan) Lin
b7c2a65139 [MPS] Fix type casting copy with storage offset (#95573)
This PR handles the case where the `dst` tensor of type casting has a storage offset by creating a temporary buffer to store results and then copy them back to the dst with the offset added.

Fixes #95417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95573
Approved by: https://github.com/kulinseth
2023-02-28 05:24:31 +00:00
Li-Huai (Allan) Lin
4930ae7f82 [MPS] Add roll op (#95168)
Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know).

Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS.

Did a small benchmark:

```python
for num in [10, 100, 1000, 10000]:
    for shft in [1, 5]:
        sz = num * num
        x = torch.arange(sz, device="cpu").view(num, num)
        s = time.time()
        r = torch.roll(x, shft)
        cpu_e = time.time() - s
        x = torch.arange(sz, device="mps").view(num, num)
        s = time.time()
        r = torch.roll(x, shft)
        mps_e = time.time() - s
        print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}")
```

```
size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578
size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172
size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172
size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375
size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938
size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203
size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945
size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168
Approved by: https://github.com/albanD
2023-02-27 18:31:17 +00:00
Nikita Shulga
fd8367a7b1 [MPS][BE] Introduce xfail (#95045)
Add `mps_ops_modifier` function that adds `unittest.expectedFailure` decorators to the operators that supposed to fail on MPS.

This allows one to know whether or not operation will fail, rather than skip it.
For example:
```
% python test_mps.py -v -k test_output_match_dot
test_output_match_dot_cpu_float32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int16 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int64 (__main__.TestConsistencyCPU) ... expected failure
test_output_match_dot_cpu_uint8 (__main__.TestConsistencyCPU) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.175s

OK (expected failures=1)
```

Moved a few functions from blocklist to xfail, and find out that some of the functions in the list actually work, for example `torch.long`.

Also, allow `None` to be used in `ALLOWLIST`  instead of specifying all types explicitly (which aligns with `DecorateInfo` semantic)

Eventually, we should get rid of `ALLOWLIST` (i.e. all ops are allowed), keep small `BLOCKLIST` and move the rest to `XFAILLIST`

Add step to print HW/SW info before running MPS tests.

Fix type promotion in `trace_mps_out`

Introduce `MACOS_12_X_XFAILLIST` and skip almost every function for `torch.uint8`,  although some of those doesn't make much sense and feels like a regression from PyTorch-1.13

Re-enabled MPS testing on MacOS 12, as runners seems to be available again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95045
Approved by: https://github.com/albanD
2023-02-27 15:01:01 +00:00
Li-Huai (Allan) Lin
4dca9bde05 [MPS] Add fmax fmin op (#95191)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95191
Approved by: https://github.com/kulinseth
2023-02-25 07:21:48 +00:00
Li-Huai (Allan) Lin
5cad542e43 [MPS] Add log_sigmoid op (#95280)
1. Add log_sigmoid.
2. Make log1p a common function. Operators that use log1p: mish, softplus, log_sigmoid (maybe more).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95280
Approved by: https://github.com/kulinseth
2023-02-24 01:38:30 +00:00
alexdremov
b9e95158d5 [MPS] Fix LSTM backward and forward pass (#95137)
Fixes #91694
Fixes #92615

Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`.

After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.

After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded

Funny enough, backward tests were completely disabled before and were not passing:
```python
    @unittest.skipIf(True, "Backward of lstm returns wrong result")
    def test_lstm_2(self, device="mps", dtype=torch.float32):
```

UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer
2023-02-23 17:32:42 +00:00
Denis Vieriu
86efa104f5 [MPS] Fix view op slicing for 2nd dim in case of 0 offset (#95381)
* Fix view op slicing for 2nd dim in case of 0 offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95381
Approved by: https://github.com/razarmehr
2023-02-23 17:26:10 +00:00
XiaobingSuper
5730cabdd0 using float type to do the computation of norm reduce for cpu half and bfloat16 dtype (#95166)
As the title, we should use a higher dtype to compute norm reduce for half and bfloat1 dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95166
Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/ngimel, https://github.com/lezcano
2023-02-23 05:00:25 +00:00
Li-Huai (Allan) Lin
69c76ff05e [MPS] Add xlogy op (#95213)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95213
Approved by: https://github.com/kulinseth, https://github.com/soulitzer
2023-02-22 19:43:12 +00:00
Denis Vieriu
5e47571a13 [MPS] Convolution cleanup; remove unnecessary contiguous calls (#95078)
- Fixes convolution crashes in backward with weights
- Removes unnecessary contiguous calls
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95078
Approved by: https://github.com/kulinseth
2023-02-22 18:04:12 +00:00
Kulin Seth
02a6d4334b [MPS] Handle broadcasting by expanding src tensor in Copy.mm (#95272)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95272
Approved by: https://github.com/DenisVieriu97
2023-02-22 18:02:42 +00:00
Denis Vieriu
8475af7761 [MPS] Cast int64 to int32 for reduction ops (#95231)
- give warnings of converting int64 for reduction ops
- use cast tensor for reduction sum on trace
- unblock trace from running
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95231
Approved by: https://github.com/razarmehr
2023-02-22 17:23:25 +00:00
Li-Huai (Allan) Lin
f70a3430aa [MPS] Add hypot op (#95196)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95196
Approved by: https://github.com/kulinseth
2023-02-21 22:40:20 +00:00
Li-Huai (Allan) Lin
e0a0329a67 [MPS] Add hardsigmoid op (#95164)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95164
Approved by: https://github.com/kulinseth
2023-02-21 07:06:37 +00:00
Li-Huai (Allan) Lin
d96aac8d2a [MPS] Add logit op (#95162)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95162
Approved by: https://github.com/kulinseth
2023-02-21 07:02:45 +00:00
alexdremov
a17a7ccc92 [MPS] LogSoftmax numerical stability (#95091)
Fixes #94043

Calculations are now consistent with numericaly stable formula and CPU:

$LogSoftmax(X, \dim) = X - \max(X, \dim) - \log(sum(X - \max(X, \dim), \dim))$

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95091
Approved by: https://github.com/malfet, https://github.com/kulinseth
2023-02-18 18:26:29 +00:00
Ramin Azarmehr
9511b9fad2 [MPS] Fix copy_cast_mps() on tensors with storage offset (#95093)
- The copy_cast path requires storage_offset to be applied before casting
- This should fix some correctness issues in transformer models

Fixes #94980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95093
Approved by: https://github.com/kulinseth
2023-02-18 16:29:01 +00:00
Li-Huai (Allan) Lin
25ee6dd335 [MPS] Fix fill_ where input tensor has a storage offset (#95113)
Fixes #94390

Apart from fixing the issue above, this PR also fixes a bug that when an input tensor can be sliced, a sliced array view is created. This array view seems to be not writable or have a different storage from the original tensor, causing incorrect results with the in-place `fill`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95113
Approved by: https://github.com/kulinseth
2023-02-18 16:19:15 +00:00
Li-Huai (Allan) Lin
0a9c608461 [MPS] Fix tensor with non-zero storage offset graph gathering (#91071)
Previously, the "can slice" flag in Placeholder constructor in `OperationUtils.mm` is conditioned on whether the numbers of dimensions of base shape and view shape are the same. This doesn't consider the situation that a view tensor could be the base tensor's sliced and then unsqueezed version, resulting in different num of dims.

For example, if we want to stack `y_mps` and `x_mps` on the last dim:
```
t_mps = torch.tensor([1, 2, 3, 4], device="mps")
x_mps = t_mps[2:]  # [3, 4]
y_mps = t_mps[:2]  # [1, 2]

res_mps = torch.stack((y_mps, x_mps), dim=-1)
```

the kernel will unsqueeze both of them on the last dim and then concatenate them, which is equivalent to:

```
res_mps = torch.cat((y_mps.unsqueeze(-1), x_mps.unsqueeze(-1)), dim=-1)
```

`x_mps.unsqueeze(-1)` is an unsqueezed and contiguous tensor with a storage offset, this kind of tensors should be sliceable without cloning its storage.

Fixes #87856
Fixes #91065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91071
Approved by: https://github.com/kulinseth
2023-02-17 18:44:20 +00:00
Denis Vieriu
a2afc657da [MPS] Fix upsample for NHWC output (#94963)
Fixes https://github.com/huggingface/diffusers/issues/941

**Before**:
<img width="1144" alt="Screenshot 2023-02-15 at 8 11 53 PM" src="https://user-images.githubusercontent.com/104024078/219266709-6a77636a-2fc0-4802-b130-85069b95953f.png">

**After**:
<img width="1144" alt="Screenshot 2023-02-15 at 8 12 02 PM" src="https://user-images.githubusercontent.com/104024078/219266694-ea743c02-fb55-44f1-b7d6-5946106527c3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94963
Approved by: https://github.com/razarmehr
2023-02-17 05:07:22 +00:00
Denis Vieriu
5d1e9fd214 [MPS] Fix prelu backward pass (#94933)
Allocate the correct shape for the weights gradient
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94933
Approved by: https://github.com/razarmehr
2023-02-17 03:45:12 +00:00
Denis Vieriu
bc361fdfdf [MPS] Fix bilinear backward pass (#94892)
Fixes backward pass for bilinear.

Summary of changes:
- bilinear op is able to produce **contiguous, non-view** tensors with a storage offset, such as: shape=`[1, 1, 1, 1]`, `storage_offset=12`. This seems a weird case, but it is valid, and for these type of tensors we wouldn't be able to gather/scatter since we look at the view flag (which is not set here). This change looks into `storage_offset` only rather than the is_view flag which is not being set
- **reduction sum** must return a zeroed out output if passing an input with 0 elements (e.g a shape of (0, 5)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94892
Approved by: https://github.com/kulinseth
2023-02-16 00:30:29 +00:00
Kulin Seth
54ebf255ab [MPS] Fixes for LSTM. (#94889)
- Backward pass has to give explicit bias tensor of zeros if none is passed to the op or the bias gradient will not be calculated.
- Fixed bias tensor mistakenly getting overwritten to zeros
- Fixes crash when lstm op called with has_biases set to false. Change takes into account the changed shape of the input params TensorList depending on the bias flag.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94889
Approved by: https://github.com/DenisVieriu97
2023-02-15 16:10:40 +00:00
Denis Vieriu
71ec2617d2 [MPS] Block uint8 data type for unary and binary ops on macOS 12 (#94876)
Blocks uint8 data type for unary and binary ops on macOS 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94876
Approved by: https://github.com/kulinseth
2023-02-15 06:09:56 +00:00
Kulin Seth
94f0808629 [MPS] Add fmod op. (#94722)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94722
Approved by: https://github.com/DenisVieriu97
2023-02-14 14:55:26 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Denis Vieriu
1f06a71797 [MPS] Error out for square int64 input (#94766)
- add checks for whether macOS is greater than 13.2
- remove square from block list
- throw error messages if power int64 is called before macOS 13.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94766
Approved by: https://github.com/kulinseth
2023-02-14 04:45:41 +00:00
Denis Vieriu
cedb7e3d77 [MPS] Fix remainder op for integral dtypes (#94757)
Map remainder op to the same template as div (integral dtypes will be cast to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94757
Approved by: https://github.com/kulinseth
2023-02-14 01:06:49 +00:00
Denis Vieriu
4acdc446b2 [MPS] Fix batch norm for NHWC (#94760)
Fixes `test_modules.py` batch norm NHWC testcases:
- `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32`
- `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94760
Approved by: https://github.com/kulinseth
2023-02-13 23:31:10 +00:00
OwenPendrighElliott
840fb74ec8 86990 range mps support (#91075)
Fixes #86990

- Added range_mps_out to RangeFactories.mm
- Updated native_functions.yaml
- Added tests in test_mps.py

I did observe that despite [the documentation for torch.range](https://pytorch.org/docs/stable/generated/torch.range.html), the existing implementations do not adjust their return type based off the arguments passed to them. The MPS implementation provided here behaves the same way as the existing CPU and CUDA implementations in this regard, hence the conversion to float32 in the test cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91075
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-13 23:19:10 +00:00
Ramin Azarmehr
b57e6fdb50 [MPS] Enable Memory Leak Detection for test_mps.py (#94646)
- To check for Memory Leaks in `test_mps.py`, set the env-variable `PYTORCH_TEST_MPS_MEM_LEAK_CHECK=1` when running test_mps.py (used CUDA code as reference).
- Added support for the following new python interfaces in MPS module:
`torch.mps.[empty_cache(), set_per_process_memory_fraction(), current_allocated_memory(), driver_allocated_memory()]`
- Renamed `_is_mps_on_macos_13_or_newer()` to `_mps_is_on_macos_13_or_newer()`, and `_is_mps_available()` to `_mps_is_available()` to be consistent in naming with prefix `_mps`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94646
Approved by: https://github.com/malfet
2023-02-13 17:56:24 +00:00
Kulin Seth
18587cb31f [MPS] Add sort and argSort Op. (#94697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94697
Approved by: https://github.com/DenisVieriu97
2023-02-13 01:03:22 +00:00
Xuehai Pan
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
Ramin Azarmehr
bdd8f518d7 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-12 21:22:30 +00:00
Henry Cheng
fe0c7fbcf8 [MPS] Add repeat_interleave to MPS (#88649)
Fixes #87219

Implements new ``repeat_interleave`` function into ``aten/src/ATen/native/mps/operations/Repeat.mm``
Adds it to ``aten/src/ATen/native/native_functions.yaml``
Adds new test ``test_repeat_interleave`` to ``test/test_mps/py``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88649
Approved by: https://github.com/kulinseth
2023-02-12 08:43:55 +00:00
Denis Vieriu
b794fd19c5 [MPS] Add scatter gather kernels (support up to 5 dimensions) (#94663)
Add scatter gather kernels (support up to 5 dimensions)
- Fixes int64 issues for `mH`, `mT`, `T`, `H` on Monterey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94663
Approved by: https://github.com/kulinseth
2023-02-12 08:17:26 +00:00
Kulin Seth
54c0f37646 [MPS] Add support for TopK k>16 (#94639)
Fixes: https://github.com/pytorch/pytorch/issues/78915

* Add the topk>16 support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94639
Approved by: https://github.com/DenisVieriu97
2023-02-12 00:57:53 +00:00
Denis Vieriu
4a762cb622 [MPS] Fix channels last copies in ELU,ReLU and Hardswish (#94664)
Fixes test_modules.py tests:
```
test_memory_format_nn_Hardswish_mps_float32
test_non_contiguous_tensors_nn_Hardswish_mps_float32
test_memory_format_nn_ReLU_mps_float32
```
Fixes elu when ran with `ChannelsLast` memory format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94664
Approved by: https://github.com/kulinseth
2023-02-11 22:05:21 +00:00
Kulin Seth
c74f438c01 [MPS] Fix the cat op for NHWC case (#94662)
* add unit test cat with non-contiguous

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94662
Approved by: https://github.com/DenisVieriu97
2023-02-11 19:43:33 +00:00
PyTorch MergeBot
4fe365774a Revert "[MPS] Add Python Module Bindings for the MPS backend (#94417)"
This reverts commit beb4f5bf39.

Reverted https://github.com/pytorch/pytorch/pull/94417 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to break MacOS test in trunk bae397ec63
2023-02-11 05:24:45 +00:00
Ramin Azarmehr
030209088f [MPS] Fix the regression with test_index_select_scalar() (#94645)
The PR #94347 caused a regression in test_mps which this patch fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94645
Approved by: https://github.com/DenisVieriu97
2023-02-11 01:36:51 +00:00
Denis Vieriu
7ce785b50b [MPS] Fix gelu forward and backward ops (#94529)
Forward pass:
```
fix gelu_out_mps key
add calculation for gelu with tanh
remove gelu from blocklist
```
Backward pass:
```
fix gelu_backward_out_mps key
uniform format
add caculation for tanh approximate backward pass
unblock grad test from blocklist
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94529
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-02-11 00:24:30 +00:00
Denis Vieriu
507b8c3423 [MPS] Native implementation for addr (#94538)
```
addr_out_mps to perform res = betainput + alpha(vec1Xvec2)
move addr f16 to low precision list
move addr none float to unsupported list
add test_addr tests
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94538
Approved by: https://github.com/razarmehr
2023-02-11 00:16:50 +00:00
Denis Vieriu
0b31ebf9e4 [MPS] Added zero check to inverse & fix for any op to avoid segfault issue (#94551)
Fixes empty placeholder error in inverse op. Change to any op should also resolve previously seen segfaults
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94551
Approved by: https://github.com/kulinseth
2023-02-10 23:39:12 +00:00
Ramin Azarmehr
beb4f5bf39 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-10 23:18:41 +00:00
Denis Vieriu
728dfeee48 [MPS] Fix ops with bool issues in macOS Monterey (#94464)
Summary:
- Remove redundant bool casts from scatter/gather
- Make the workarounds for scatter/gather (for bool/uint8 data types) OS specific - use them only in macOS Monterey, ignore them starting with macOS Ventura
- Make all tensors ranked in scatter

Fixes following tests:
```
test_output_match_slice_scatter_cpu_bool
test_output_match_select_scatter_cpu_bool
test_output_match_diagonal_scatter_cpu_bool
test_output_match_repeat_cpu_bool
test_output_match_rot90_cpu_bool
etc..
```

Still failing on macOS Monterey (needs additional investigation):
```
test_output_match_scatter_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94464
Approved by: https://github.com/kulinseth
2023-02-10 21:36:25 +00:00
Ramin Azarmehr
7c4acdad4a [MPS] Fix the crash in huberloss with Float16 (#94567)
- Also fix FP16 correctness issues in several other ops by lowering their FP16 precision in the new list `FP16_LOW_PRECISION_LIST`.
- Add atol/rtol to the `AssertEqual()` of Gradient tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94567
Approved by: https://github.com/kulinseth
2023-02-10 19:20:29 +00:00
Denis Vieriu
92d8c4b37c [MPS] Fix cumsum for integral data types (#94530)
- Make intermediate type for cumsum ScalarType::Int: fixes https://github.com/pytorch/pytorch/issues/90635
- Add support for negative dimensions in cumsum: fixes https://github.com/pytorch/pytorch/issues/92329
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94530
Approved by: https://github.com/kulinseth
2023-02-10 17:40:29 +00:00
Kulin Seth
1d3980656c [MPS] Fix min/max_reduction_with_dim ops (#94386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94386
Approved by: https://github.com/DenisVieriu97, https://github.com/razarmehr
2023-02-10 15:23:47 +00:00
Kulin Seth
0fe11589df [MPS] Add im2col and col2im to Fallback (#94491)
These are not in the hot path  as they are mostly used in Preprocessing layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94491
Approved by: https://github.com/razarmehr
2023-02-10 15:22:59 +00:00
PyTorch MergeBot
f152a79be9 Revert "update aten op overload to not use from to avoid compile errors (#89797)"
This reverts commit 021d267694.

Reverted https://github.com/pytorch/pytorch/pull/89797 on behalf of https://github.com/jeanschmidt due to breaking internal builds - more details on https://fburl.com/sandcastle/bz8mgkil
2023-02-10 11:32:25 +00:00
Denis Vieriu
a1f15fb987 [MPS] Fix batchnorm forward and backward pass (#94351)
Fixes batchnorm forward/backward pass and layer_norm:

Batchnorm Forward pass:
```
- fix batch_norm_mps_out key
- return 1/sqrt(var+epsilon) instead of var
- return empty tensor for mean and var if train is not enabled
- remove native_batch_norm from block list
```

Batchnorm Backward pass:
```
- add revert caculation for save_var used in backward path
- add backward test for native_batch_norm and _native_batch_norm_legit
```

Layer norm:
```
- remove the duplicate calculation from layer_norm_mps
- enable native_layer_norm backward test
- raise atol rtol for native_layer_norm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94351
Approved by: https://github.com/razarmehr
2023-02-10 05:53:36 +00:00
Denis Vieriu
016f0b2f62 [MPS] Calculate nonzero count inside nonzero op (#94442)
Calculate nonzero count directly in the nonzero op.
Additionally, synchronize before entering nonzero op to make sure all previous operations finished (output shape is allocated based on the count_nonzero count)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94442
Approved by: https://github.com/kulinseth
2023-02-10 00:53:52 +00:00
Denis Vieriu
336d9354d6 [MPS] Enable index add for TestConsistency (#94356)
Enable index_add TestConsistency TestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94356
Approved by: https://github.com/kulinseth
2023-02-10 00:21:11 +00:00
Kulin Seth
299ada9cff [MPS] Add the floor_divide fixes. (#94488)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94488
Approved by: https://github.com/razarmehr
2023-02-10 00:10:08 +00:00
Kulin Seth
f35f12320a [MPS] Fixes for arange_mps for empty tensor. (#94485)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94485
Approved by: https://github.com/razarmehr
2023-02-09 19:30:17 +00:00
Kulin Seth
105f7205bd [MPS] Fix and unblock TestConsistency for median (#94489)
- fix num_output_dims calculation
- fix median_out_mps key
- cast tensor sent to sortWithTensor and argSortWithTensor
- note down same issue for unique
- unblock median from blocklist
- adding test_median_int16 test

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94489
Approved by: https://github.com/razarmehr
2023-02-09 19:29:07 +00:00
Ramin Azarmehr
4f691d2e2f [MPS] Fix correctness issue with fill_scalar_mps() (#94479)
- The self was not contiguous and inline filling produced wrong results
- Added a test case for the issue

Fixes the zero_like() issue reported in #94190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94479
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2023-02-09 19:07:13 +00:00
jinsu kim
a5b052259b Add MPS support for aten::remainder.Tensor_out (#92139)
Fixes #86806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92139
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-09 15:32:30 +00:00
Soof Golan
e4fe11eecb [MPS] Fix torch.topk for empty tensors and k=0 on mps (#91884)
Fixes #91878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91884
Approved by: https://github.com/kulinseth
2023-02-09 10:42:52 +00:00
Soof Golan
19264b50bb [MPS] Add support for nansum on mps (#93845)
* Add `nansum_out_mps` and `nansum_mps` functions
* Moved `get_dtype_from_self` into ReduceOpsUtils.h

Fixes #86809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93845
Approved by: https://github.com/malfet
2023-02-09 10:30:55 +00:00
Kulin Seth
02ca2253cc [MPS] Fixes for Binary ops with casting issues from FP to uint8 (#94382)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94382
Approved by: https://github.com/razarmehr
2023-02-09 09:44:02 +00:00
Denis Vieriu
5b8e485a34 [MPS] Add 2d grid sampler (#94273)
Add support for MPS grid sampler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94273
Approved by: https://github.com/razarmehr
2023-02-09 02:25:46 +00:00
Ramin Azarmehr
6c80d0a5a5 [MPS] Fix correctness issues with Pool2D ops (#94348)
- Fix wrong results in AvgPool2D when `count_include_pad=True`
- Fix issues with adaptive average and max pool2d
- Remove the redundant blocking copies from `AdaptiveMaxPool2d`
- Add `divisor` to cached string key to avoid conflicts
- Add test case when both `ceil_mode` and `count_include_pad` are True (previously failed).
- Clean up redundant code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94348
Approved by: https://github.com/kulinseth
2023-02-09 02:06:40 +00:00
Elias Ellison
021d267694 update aten op overload to not use from to avoid compile errors (#89797)
Fix for https://github.com/pytorch/pytorch/issues/93591 by changing `random_.from` to `random_.from_int`.

The previous signature would fail when printed in an fx graph, because `from` is a reserved python keyword. This change affects serialization but I have added an adapter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89797
Approved by: https://github.com/tugsbayasgalan
2023-02-08 22:04:59 +00:00
Denis Vieriu
22e1698cf7 [MPS] Add triangular solve op through MPSMatrixSolveTriangular (#94345)
Add triangular solve op support through MPS `MPSMatrixSolveTriangular` kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94345
Approved by: https://github.com/razarmehr
2023-02-08 21:48:12 +00:00
Denis Vieriu
5d48392abb [MPS] Skip gather/blit calls in case of strided output (#94260)
Skip gather/blit calls in case of strided output - this prevents:

- allocating additional memory for the output
- additional transpose for both the input and output
Fixes:
```
x = torch.rand((256,10), device='mps')
x = x.permute(1,0)
x.exp()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94260
Approved by: https://github.com/razarmehr
2023-02-07 16:25:03 +00:00
Denis Vieriu
86ae14deaa [MPS] Fix MPSGraph casting issue to MPSDataTypeBool in masked_fill op (#94263)
Fixes TestConsistency masked_fill for bool data type.

Casting a tensor > 1 to MPSDataTypeBool will result in 0 instead of 1. This change manually casts the scalar to a value of 0 or 1 when casting a non-boolean tensor to a boolean tensor:
```
(inputDataType == MPSDataTypeBool) ? !!value.to<double>() : value.to<double>()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94263
Approved by: https://github.com/razarmehr
2023-02-07 16:20:55 +00:00
Denis Vieriu
e3ac109618 [MPS] Fallback on gather code to solve view tensors when a slice is followed by a reshape (#94278)
There are cases when the arrayViewTensor API cannot be used to solve the view operations, such as when a view dimension is bigger than the base dimension of the tensor, e.g:
```
base shape: [1, 768, 512, 2] // we cannot slice the base shape in any way to result in first dimension `2`
view shape: [2, 384, 512, 1]
```
On such cases, we need to fallback on the gather code (that detects this is a slice followed by a reshape) to solve this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94278
Approved by: https://github.com/razarmehr
2023-02-07 16:20:08 +00:00
Kulin Seth
4cd086b14c [MPS] Raise error for int64 inputs of dot operator. (#94270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94270
Approved by: https://github.com/razarmehr
2023-02-07 16:12:17 +00:00
Ramin Azarmehr
b654d1494b [MPS] Fix the argument error for tensor_split() test (#94234)
The second tensor argument `tensor_indices_or_sections` of tensor_split() must be on CPU when testing it in TestConsistency. Otherwise it will error out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94234
Approved by: https://github.com/kulinseth
2023-02-07 15:56:49 +00:00
Ramin Azarmehr
36062dd2b4 [MPS] Fix the crash in View ops when slicing wrong lengths (#94259)
The offset + length of destination tensor should not be larger than source's length when slicing

Fixes #94190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94259
Approved by: https://github.com/malfet
2023-02-07 15:51:26 +00:00
Ramin Azarmehr
bc8a378333 [MPS] Unregister put_() op due to lack of implementation (#94231)
Currently, the `put_()` is not implemented on MPS backend, so this patch will unregister it and insert it into blocklist of TestConsistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94231
Approved by: https://github.com/kulinseth
2023-02-07 06:54:15 +00:00
Kulin Seth
ca74105377 [MPS] Add scalar params to the softplus key. (#94256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94256
Approved by: https://github.com/razarmehr, https://github.com/malfet
2023-02-07 03:04:53 +00:00
Denis Vieriu
9358726a06 [MPS] Handle empty input in layer norm (#94212)
Handle empty input in layer norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94212
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-02-07 02:55:48 +00:00
Ramin Azarmehr
368e364c19 [MPS] Fix gradient issues with NLL and Smooth_L1 loss ops (#94226)
- Fix correctness issues with nll_loss_backward(), smooth_l1_loss_backward() and cross_entropy_backward() by taking grad_output into account when computing those loss ops
- Add numel()==0 check to prevent crashes
- Clean up and formatting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94226
Approved by: https://github.com/kulinseth
2023-02-07 01:54:18 +00:00
Nikita Shulga
10a1efb49f [MPS] Fix cumsum for negative indexes (#94119)
Use `wrap_dim` to get dim in range or range IndexError

Add test to test for that

Addresses feedback raised in https://github.com/pytorch/pytorch/pull/88319#issuecomment-1403541180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94119
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2023-02-05 18:21:29 +00:00
Nikita Shulga
8a88852d5f [MPS] Fix index_select for empty input (#94117)
Also add test for this case to `test_index_select`
Fixes https://github.com/pytorch/pytorch/issues/93877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94117
Approved by: https://github.com/orionr
2023-02-05 05:45:57 +00:00
Jane Xu
b90496eef5 [nn] zero_grad() set_to_none default True (#92731)
Attempts to fix #92656

BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
2023-01-26 01:04:28 +00:00
Li-Huai (Allan) Lin
ccbdf49582 [MPS] Fix index_select scalar input with multiple indices (#91064)
Support operations like this:

```
device="mps"
arr = torch.tensor(10, device=device)
indices = torch.tensor([0, 0], device=device)  # multiple indices
torch.index_select(arr, 0, indices)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91064
Approved by: https://github.com/kulinseth
2023-01-19 14:08:02 +00:00
lezcano
46a81c8db7 Deprecate .mT,.T,.mH,.H on 0D tensors (#92143)
As discussed with @ngimel, this is not only not documented,
but also an unnecessary edge case. See https://github.com/pytorch/pytorch/pull/90463#discussion_r1064807197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92143
Approved by: https://github.com/ngimel
2023-01-17 16:54:35 +00:00
Denis Vieriu
0a677f2335 [MPS] Add testcase for copying cpu tensors into strided mps tensors (#91784)
Fixes https://github.com/pytorch/pytorch/issues/86975

If the destination is a strided MPS tensor and the source is a CPU tensor, we cannot perform a blit directly to copy the memory from the CPU tensor into the MPS tensor. We need to scatter the data into the right indices.
```
        a1 = torch.Tensor([[1,2],[3,4], [5,6]]).to(torch.device("mps"))
        b1 = torch.Tensor([-1, -1])
        a1[1:,1] = b1  # strided MPS destination / contiguous CPU source
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91784
Approved by: https://github.com/kulinseth
2023-01-10 22:45:48 +00:00
Denis Vieriu
e0b82d7d1f [MPS] Fix convolution `Source and weight input channels mismatch' crash (#91822)
Fixes crashes in conv input/weight backward passes due to NCHW / NHWC formats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91822
Approved by: https://github.com/razarmehr
2023-01-10 18:30:18 +00:00
Denis Vieriu
0ec3c5bc72 [MPS] Reduce ops multi axes support (#91734)
Currently, most of the reduction ops are flattening the input tensor to 1D to perform the operation.
This change removes the flattening of the tensors / the unranked placeholders and adds support for multi axes in all the reduction ops.

- Fixes reduction ops with correctness and shape issues.
- Fixes masked.argmax / masked.argmin. In case of passing inf to argmax / argmin, MPS will return nan as index for these numbers. Casting this nan to Long will make it -1. This change avoids negative values by clamping them to 0 (matching CPU results).

TestConsistency issues fixed:
```
std
var
amax
amin
sum
prod
mean
count_nonzero
masked.amax
masked.amin
masked.mean
masked.prod
masked.std
masked.sum
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91734
Approved by: https://github.com/kulinseth
2023-01-09 10:55:11 +00:00
Denis Vieriu
53ef96faae [MPS] Add support for randperm (#91708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91708
Approved by: https://github.com/kulinseth
2023-01-06 22:49:06 +00:00
Ramin Azarmehr
87164ace51 [MPS] Fix the ChannelsLast memory format in cat_out_mps() (#91786)
- Fixed the memory leak with the `malloc()`
- Introduced shortened data type strings (optional) to avoid getting extra long cached graph string keys with ops such as cat_out()
- Fixed data type issues in Monterey
- Removed the unused `use_scalar_value` argument from `getTensorsStringKey()`
- Clean up and refactoring

Fixes #89353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91786
Approved by: https://github.com/kulinseth
2023-01-06 17:28:49 +00:00
Ramin Azarmehr
2f0e4839ee [MPS] Fix correctness issues with Pooling ops (#91519)
- Workaround for MaxPool when ceilMode=true
- Workaround for ChannelsLast memory format
- Workaround for divisor_override in AvgPool ops
- Enabled count_include_pad parameter for AvgPool
- Refactoring and clean up of duplicate code
- Enable MaxPool tests in TestConsistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91519
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-01-06 01:35:46 +00:00
Denis Vieriu
1a0738f599 [MPS] Add support for torch.linalg.cross (#91642)
* Add support for torch.linalg.cross
* Make use of `metal::cross` for float and half. For the other dtypes implement cross manually

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91642
Approved by: https://github.com/razarmehr, https://github.com/malfet
2023-01-05 14:48:34 +00:00
Ramin Azarmehr
229f12bf6a [MPS] Implement nan_to_num() for MPS backend (#91110)
Added a test case, and also enabled it in TestConsistency

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91110
Approved by: https://github.com/malfet, https://github.com/kulinseth
2023-01-05 02:17:48 +00:00
Ramin Azarmehr
b44d46702a [MPS] Fix correctness issues with Upsample 1D and 2D (#91669)
- Implemented following new ops: upsample_nearest1d_backward
upsample_nearest_exact1d
upsample_nearest_exact1d_backward
- Moved Upsample code from Shape.mm to Upsample.mm
- Fallback to CPU for nearest mode on Monterey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91669
Approved by: https://github.com/malfet
2023-01-05 00:48:54 +00:00
Ramin Azarmehr
7dd28e9e83 [MPS] Fix data type and shape issues in Scatter and Gather ops (#91514)
- Clean up redundant code and headers
- Move scatter/gather ops from block list to allow list in TestConsistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91514
Approved by: https://github.com/kulinseth
2023-01-04 23:20:01 +00:00
Kulin Seth
fc59664ef4 [MPS] Add Unique and unique_consecutive ops. (#88532)
Add check for macos 13.0

Fixes #88487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88532
Approved by: https://github.com/malfet
2023-01-04 22:15:13 +00:00
Ramin Azarmehr
13de5a0150 [MPS] Fix the right padding bug in Monterey (#91522)
- Workaround for the bool type bug in padding (needed for both Monterey and Ventura)
- Move the recently fixed padding tests of TestConsistency to AllowList

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91522
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet
2023-01-04 22:00:37 +00:00
Denis Vieriu
80394bb734 [MPS] Register norm_dtype_out_mps and cdist (#91643)
Add support for `norm_dtype_out` and `cdist` ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91643
Approved by: https://github.com/razarmehr
2023-01-04 02:20:53 +00:00
Denis Vieriu
38de981e16 [MPS] Add nonzero mps support (#91616)
Adds nonzero support for mps:

  **Pseudocode**:
  ```
  //
  // inputTensor   = [1,  0,  0,  3]
  // inputNonZero  = [1,  0,  0,  1] (input != 0)
  // scan          = [1,  1,  1,  2] (prefix sum)
  // maskedIndices = [0, -1, -1,  1] (select)
  // coordinates   = [0,  1,  2,  3] (coordinateAlongAxis)
  // scatterResult = [0,  3]         (scatter)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91616
Approved by: https://github.com/razarmehr
2023-01-04 00:02:24 +00:00
Ramin Azarmehr
688e351970 [MPS] Implement MPSGenerator to enable manual random seeding (#91348)
This patch adds support for creating torch.Generator for MPS device, and enables its functions such as manual_seed, get_state, and set_state.
Fixes #84288 and #84516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91348
Approved by: https://github.com/malfet, https://github.com/albanD
2023-01-03 16:01:19 +00:00
Denis Vieriu
f7939b21e1 [MPS] Add bincount support for mps (#91267)
Add support for bincount on MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91267
Approved by: https://github.com/razarmehr
2023-01-03 06:01:07 +00:00
Denis Vieriu
dbf96164be [MPS] Add suport for casting updatesTensor directly in scatter (#91197)
Fixes copies into slices where the input data type is different than the output dtype.

This change removes the cast done before scatter, so we don't have to allocate additional memory to perform the casting. Scatter handles the casting directly now.

device = "mps"
shape = (4, 4)
tensor = torch.randint(10, shape, device=device)
tensor_before = tensor.clone()
res = torch.empty(shape[0], shape[1] * 2, device=device)[:, ::2].copy_(tensor)
torch.testing.assert_close(tensor, tensor_before)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91197
Approved by: https://github.com/razarmehr
2023-01-02 16:31:27 +00:00
Denis Vieriu
bdbf188c80 [MPS] Exclude int64 dtype from reduction ops (#91272)
Reduction ops don't support int64 data type. This PR takes care to assert when int64 is used for min / max reductions ops.
All other integer dtypes are casted to int32.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91272
Approved by: https://github.com/razarmehr, https://github.com/malfet
2022-12-23 17:30:42 +00:00
Ramin Azarmehr
6485d2609a [MPS] Fix data type issues in Binary Ops (#91151)
- Cast to unsigned type when comparing signed vs. unsigned integers
- Refactor and cleanup logaddexp() ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91151
Approved by: https://github.com/malfet
2022-12-23 17:11:55 +00:00
Denis Vieriu
4477a5b691 [MPS] Register unfold key for MPS (#91266)
Register unfold key for MPS (uses generic implementation that's already existent).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91266
Approved by: https://github.com/razarmehr
2022-12-22 21:21:04 +00:00
Nikita Shulga
fd3a7264ae [MPS] Add group_norm[fwd+backward] and mean_var (take 2) (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private

Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-22 08:54:37 +00:00
Denis Vieriu
81a9a0ac07 [MPS] Fix gather for uint8 dtype in index_select (#91047)
Use int8 instead of uint8 for MPS Gather/Scatter (uint8 is broken in macOS Monterey)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91047
Approved by: https://github.com/razarmehr
2022-12-21 19:48:46 +00:00
Li-Huai (Allan) Lin
b7f35e4104 [MPS] Fix index_add with non-f32 inputs (#88542)
The `multiplicationWithPrimaryTensor` and/or `scatterWithDataTensor` api has issues with handling two f16 tensor inputs, resulting in zeros outputs. With int16 or int64 inputs, there are issues as well.

This PR conditionally casts inputs to f32 if they're not and then casts the output back to the source's datatype.

Fixes #82645.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88542
Approved by: https://github.com/kulinseth
2022-12-21 05:31:03 +00:00
Ramin Azarmehr
a274b5b99e [MPS] Fix data type issues in Unary ops (#91120)
Refactored sigmoid() and log1p()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91120
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2022-12-21 02:42:59 +00:00
Nikita Shulga
dd735b96df [MPS] Fix torch.std/torch.var default/correction handling (#91203)
If `torch.std`, `torch.var` are invoked without any arguments, it should be assumed that `unbiased` is `True`.

Also, if `correction` parameter is specified it should be use in correction computation.

Test by adding `std` and `var` to consistency tests

Fixes https://github.com/pytorch/pytorch/issues/91198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91203
Approved by: https://github.com/kit1980
2022-12-21 02:23:50 +00:00
Ramin Azarmehr
b63f0311a5 [MPS] Add floor_divide() op and its test case (#91126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91126
Approved by: https://github.com/malfet
2022-12-20 17:02:29 +00:00
Kulin Seth
8ecb49b8fb [MPS] Add Inverse op. (#90428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90428
Approved by: https://github.com/DenisVieriu97, https://github.com/malfet
2022-12-19 22:00:12 +00:00
Nikita Shulga
3859aace20 [MPS] Skip tests broken on Ventura (#90843)
Also add `torch.backends.mps.is_macos13_or_newer`
See https://github.com/pytorch/pytorch/issues/85758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90843
Approved by: https://github.com/kulinseth, https://github.com/albanD
2022-12-14 19:51:00 +00:00
Li-Huai (Allan) Lin
544756ae5e Fix mps constant pad (#89864)
Support arbitrary dimensions for constant padding on MPS

Fixes #89624
Fixes #87277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89864
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-12-13 17:28:54 +00:00
Denis Vieriu
b71c710db1 Add additional tests for view slice tensors (#86282)
Fixes https://github.com/pytorch/pytorch/issues/83995 and https://github.com/pytorch/pytorch/issues/84489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86282
Approved by: https://github.com/kulinseth
2022-12-08 17:59:55 +00:00
PyTorch MergeBot
cba96366a2 Revert "remove torch.equal usages (#89527)"
This reverts commit 4095ef8b80.

Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
Philip Meier
4095ef8b80 remove torch.equal usages (#89527)
Preparation for the next PR in this stack: #89559.

I replaced

- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).

There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
Sergii Dymchenko
09f2373ec0 Fix TODOs related to #38095 in test_mps.py (#89815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89815
Approved by: https://github.com/weiwangmeta, https://github.com/kulinseth
2022-11-30 17:00:36 +00:00
Thomas
4935b597ac Added implementation and tests for MPS Hardswish (#87952)
## What?
Fixes issue #86807 by adding MPS backend support for aten::hardswish.

## How?
Registered mps hardswish functions in native_functions.yaml, and added the code implementation to Activations.mm.

Added functions:
- hardswish_mps
- hardswish_mps_
- hardswish_backward_mps
- hardswish_out_mps

## Testing
Added test in test/test_mps.py and tested code using the command `python3 test/test_mps.py -k test_hardswish`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87952
Approved by: https://github.com/kulinseth, https://github.com/kit1980
2022-11-23 02:18:03 +00:00
Edward Z. Yang
dbeacf1182 Fix cat striding in PrimTorch (#89332)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89332
Approved by: https://github.com/ngimel
2022-11-20 04:05:33 +00:00
PumeTu
fc1c0cd3ef Add support trace on MPS backend (#87910)
Fixes [#87221](https://github.com/pytorch/pytorch/issues/87221)
`trace` now supported on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87910
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-11-18 07:24:33 +00:00
Raman kumar
fd0efb01a7 [MPS] Support for median with dim (#88807)
## Summary 

**Aim**: Add support for aten::median for MPS backend (Fixes #87220)

This is fresh clean PR from the previous [PR](https://github.com/pytorch/pytorch/pull/88554)

- Implementing the new median function in aten/src/ATen/native/mps/operations/ReduceOps.mm
- Adding it to aten/src/ATen/native/native_functions.yaml
- Adding it to existing test_median

### **this will works like this** 🪶
median of entire input tensor on MPS
`torch.median(mps_inputTensor)`
median of along a dim
`torch.median(mps_inputTensor, dim=[int], keepdim=[Bool])`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88807
Approved by: https://github.com/kulinseth
2022-11-18 02:53:42 +00:00
Lukas Hoenig
81a8fdc40d [MPS] Add binary operations dtype precedence test case (#87545)
See https://github.com/pytorch/pytorch/pull/84742 and https://github.com/pytorch/pytorch/pull/78319.

The test case tests that
- for the binary operations (add, sub, mul, div),
- for all data types (dtypes),
- for a range of representative values and their combinations,
- for various shapes and ways of creating the test tensors,

the contents and dtype of the result tensor is identical for the MPS and CPU backends.

It adds about 15-18s runtime to `test_mps.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87545
Approved by: https://github.com/kit1980
2022-11-17 04:54:27 +00:00
Nikita Shulga
62ef15e320 [MPS] Fix test_embedding_dense_backward (#88847)
By copying randomly initialized weights distribution from MPS `nn.Embedding` to `cpu`

Test plan: `python test_mps.py -k test_embedding_dense_backward --repeat 150`

Fixes https://github.com/pytorch/pytorch/issues/88679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88847
Approved by: https://github.com/seemethere
2022-11-10 23:52:27 +00:00
Li-Huai (Allan) Lin
7c353eb395 [MPS] Fix softplus (#88555)
1. Fixes #87780
2. Fixes mps graph cache issue
3. Adds proper tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88555
Approved by: https://github.com/kulinseth
2022-11-10 09:40:08 +00:00
Nikita Shulga
078c25df13 [MPS][BE] Code cleanup (#88529)
Various code cleanup in MPS operations:
 - Per @kulinseth suggestion move `mpsSupportsCumsum` to `MPSDevice.h` and rename it to
   `is_macos_13_or_newer()`
 - Move Ventura MPSGraph new operators to `MPSGraphVenturaOps.h` header
 - Use `LookupAs` and `CreateCachedGraphAs` to make code more compact
 - Formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88529
Approved by: https://github.com/kulinseth
2022-11-08 21:10:07 +00:00
Li-Huai (Allan) Lin
15e54293ef [MPS] Fix embedding backward with scalar index (#82809)
### Description
Previously the embedding backward always expands `-1` dim to indices, resulting in the following error when the indices is a scalar:

```
 error: Rank of data array must equal number of outer dimensions in indices array + rank of slice to update, 2 != 1 + 0
-:8:10: note: see current operation: %5 = "mps.scatter_nd"(%0, %arg1, %4) {batch_dims = 0 : ui32, mode = 0 : i32} : (tensor<10x5xf16>,
```

Now makes it conditional.

Reproducer:

```python
def repro():
    w = torch.tensor([[-2.6465,  2.5859,  0.4688,  1.7949,  3.2676],
        [-3.1641,  8.9375,  5.7578, -2.9453, -6.5469],
        [ 2.0469,  1.3516, -8.7344,  6.0000,  1.3906],
        [ 6.5781,  7.8438,  6.9766,  3.2891, -5.1172],
        [-7.9414,  7.7344,  4.1875,  2.8574,  2.9531],
        [-0.4844, -5.6328, -6.8359, -4.5156,  3.7891],
        [ 4.9375,  6.6094,  6.7031,  0.6719, -6.4219],
        [ 7.0469,  8.2031,  4.4453,  1.7129, -2.4688],
        [ 1.2207, -3.3750, -2.4531,  7.4062, -6.0469],
        [-8.9688,  2.2656,  2.4160, -1.0176,  8.4531]], dtype=torch.float32, requires_grad=True)
    x = torch.tensor(5)
    out = torch.nn.functional.embedding(x, w)
    out.sum().backward()

    w_mps = w.detach().clone().to("mps").requires_grad_()
    x_mps = x.to("mps")
    out = torch.nn.functional.embedding(x_mps, w_mps)
    out.sum().backward() # error
```

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82809
Approved by: https://github.com/malfet
2022-11-04 19:43:56 +00:00
Nikita Shulga
657f2e12f0 [MPS] Add native cumsum implementation (#88319)
Using https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/4057333-cumulativesumwithtensor?language=objc

Fall back to CPU if running on older MacOS versions
In `unary_op` add output tensor dims/dtype to the graph key (as even in default op we check output graph type)
Also, upcast int16 to int32 as MPS cumsum op on Ventura returns incorrect results for Int16 type (and it makes total sense for int8, as chances for overflow are very high)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88319
Approved by: https://github.com/kulinseth
2022-11-04 01:22:41 +00:00
Philip Meier
bc73affdad prepare removal of deprecated functionality in torch.testing (#87969)
_Redo of #86586 with all BC breaking changes granularly placed into separate commits._

---

Per title. Deprecation happened on Feb 25, 2022 in c6f1bbc0ac, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969
Approved by: https://github.com/mruberry
2022-11-02 14:04:48 +00:00
arnaudstiegler
16e35bd179 Adding expm1 to MPS (#87147)
Fixes #86744

- Implementing the new `expm1_out_mps` function in `aten/src/ATen/native/mps/operations/UnaryOps.mm`
- Adding it to `aten/src/ATen/native/native_functions.yaml`
- Adding it to existing `test.test_mps.TestNLLLoss.test_unary_ops`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87147
Approved by: https://github.com/kulinseth
2022-10-26 17:45:46 +00:00
Daniel Falbel
e818574e78 Support signbit in MPS. (#87214)
Implements the signbit operator for MPS. Links to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87214
Approved by: https://github.com/kulinseth, https://github.com/kit1980
2022-10-25 07:12:31 +00:00
Alex
620dbc43d8 Slowly introduce ops to be tested by test_numpy_ref on MPS backend (#87342)
Enable a test that would have caught https://github.com/pytorch/pytorch/issues/86239

Prior to the fix for that bug, this test fails with

```
_____________________________ TestCommonMPS.test_numpy_ref_mps_where_mps_float32 _____________________________
Traceback (most recent call last):
  File "/Users/alex/git/pytorch/test/test_ops.py", line 197, in test_numpy_ref_mps
    self.compare_with_reference(
  File "/Users/alex/git/pytorch/torch/testing/_internal/common_utils.py", line 2366, in compare_with_reference
    actual = torch_fn(t_inp, *t_args, **t_kwargs)
  File "/Users/alex/git/pytorch/torch/testing/_internal/opinfo/core.py", line 1068, in __call__
    return self.op(*args, **kwargs)
  File "/Users/alex/git/pytorch/torch/testing/_internal/common_methods_invocations.py", line 15167, in <lambda>
    op=lambda self, condition, other: torch.where(condition, self, other),
RuntimeError: 0'th index 3 of x tensor does not match the other tensors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87342
Approved by: https://github.com/albanD
2022-10-21 19:03:00 +00:00
Nikita Shulga
ae62cf7c02 [MPS] Revamp copy_to_mps_ implementation (#86956)
Tensor's view in linear storage is represented by the following parameters: `.shape`, `.stride()` and `.storage_offset()`.

Only tensors that are representable as 1d-views can be copied from host to device (and vice versa) using single  [`copy(from:sourceOffset:to:destinationOffset:size:)`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) call.

Modify `copy_to_mps_` function to do the following steps:
- Cast `src` tensor to dst data type if needed
- Expand `src` tensor to `dst` tensor shape
- Clone `src` tensor if it is not stride contiguous (i.e. can not be represented by `src.view(src.numel())`)
- Create an empty tensor if `dst` is not stride-contiguous or if its strides are different then potentially cloned `src` strides
- Do 1d copy for `src` to (potentiall temp) `dst`
- Finally do re-striding/copy on MPS if needed

Add test to cover cases where stide-contiguous permuted tensor is copied to MPS, non-stride-contiguous tensor is copied to MPS and if permuted CPU tensor is copied to differently permuted MPS tensor

Fixes https://github.com/pytorch/pytorch/issues/86954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86956
Approved by: https://github.com/kulinseth
2022-10-21 14:10:05 +00:00
Nikita Karetnikov
1b8af28fe8 [primTorch] Add refs for softmax, softmin, log_softmax (#84956)
cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-20 12:29:04 +00:00
Nikita Shulga
13cff2ee8e [MPS] Copy from CPU always add storageOffset (#86958)
Because why wouldn't it?
Fixes https://github.com/pytorch/pytorch/issues/86052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86958
Approved by: https://github.com/kulinseth
2022-10-14 17:35:18 +00:00
Nikita Shulga
692b525b71 [MPS] Extend unary ops to int64 (#86615)
Most of them are already supported for `int64` except for:
 - rounding operations (`floor`, `ceil` and `round`), which are no-ops for integral types anyway
 - sign operation, when it can be emulated by clamping it tensor to [-1, 1] range

Test new types by test MPS

Fixes https://github.com/pytorch/pytorch/issues/86319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86615
Approved by: https://github.com/DenisVieriu97, https://github.com/huydhn
2022-10-12 00:32:53 +00:00
Nikita Shulga
b7b5bd47ae [MPS] Implement frac operator (#86625)
As combination if self-trunc

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86625
Approved by: https://github.com/kulinseth, https://github.com/albanD
2022-10-10 20:36:22 +00:00
Alex
ca69ddb4f7 Fix broadcasting to implicit leading dimensions in torch.where on MPS (#86240)
Fixes #86239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86240
Approved by: https://github.com/kulinseth
2022-10-07 01:38:57 +00:00
Nikita Shulga
fa799132d8 [MPS] Better error message for slow_conv2d_forward (#86303)
Error `Could not run 'aten::_slow_conv2d_forward' with arguments from the 'MPS' backend.` is very misleading as usually this method is only invoked if input is on CPU but weights are on MPS device.
Raise a more user friendly error in this case

Add test to `test_invalid_conv2d` to check for those conditions.

Fixes https://github.com/pytorch/pytorch/issues/77931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86303
Approved by: https://github.com/kulinseth
2022-10-06 15:38:57 +00:00
Nikita Shulga
97d2e1df55
[MPS] Fix GELU for torch.half (#86218)
Also, make sure it raises catcheable errors if invoked with integral types

Otherwise, it used to fail with following fatal error  invoked for `torch.half` and with similar signatures if invoked for integral types
```
loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/4883e71d-37bd-11ed-b0ef-b25c5e9b9057/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<2xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```

Modified `test_gelu_simple` to check both fwd and backward gradients for gelu
2022-10-05 09:09:17 -07:00
Kulin Seth
6a842e33c6 MPS: Add multinomial op (#80760)
Add multinomial with replacement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80760
Approved by: https://github.com/razarmehr, https://github.com/malfet
2022-10-03 21:05:30 +00:00
Abhishek Pathak
8860e48994 [MPS] Handle compatible inputs to where (#85946)
Inputs with different number of dimensions but compatible shapes were being rejected

e.g. x.shape = [10,1,10]
y.shape = [10,10]
cond.shape = [10,10,1]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85946
Approved by: https://github.com/malfet
2022-10-03 18:12:48 +00:00
Nikita Shulga
b9b24c31fd [MPS] Fix non-contig to contig tensor copy (#86056)
This handles a rare case when MPS tensor is constructed from non-contiguous CPU tensor.
Fixes https://github.com/pytorch/pytorch/issues/85967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86056
Approved by: https://github.com/janeyx99
2022-10-02 20:13:05 +00:00