Commit Graph

743 Commits

Author SHA1 Message Date
PyTorch MergeBot
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bd.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
Nikita Shulga
652a2ab93e [BE] Skip print(foo) tests (#139009)
Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output.
Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009
Approved by: https://github.com/Skylion007
2024-10-27 18:04:03 +00:00
Scott Wolchok
a3de067975 [PyTorch] Use 128-bit vectors for ARM64 (#137426)
The correct vector length for ARM64 is 128 bits (16
bytes). We were previously using double this, apparently just because
that would be the same length as AVX2.

Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716, #138744
2024-10-26 00:20:35 +00:00
Nikita Shulga
1b31248933 [EZ] Fix typo in test_mps.py (#138738)
s/emedding_weight/embedding_weight/

Stolen from 074766d9b4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738
Approved by: https://github.com/atalman
2024-10-23 22:15:35 +00:00
Tom Ritchford
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
Tom Ritchford
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
Nikita Shulga
de16159e56 [MPS] Fix sliced cast (#138314)
This fixes internal crash due to the invalid bufer size computation if sliced API is used

Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
  baseShape = src._base().sizes();
} else {
  baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
  flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.

When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes

Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
2024-10-19 05:17:09 +00:00
PyTorch MergeBot
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d75.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
Tom Ritchford
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
PyTorch MergeBot
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b40.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
Tom Ritchford
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
Nikita Shulga
0786b37260 [MPS] Add i0 op (#137849)
More-or-less verbatim copy of 47c8aa8090/aten/src/ATen/native/Math.h (L101)
Plus a bit of a MPS boilerplate code

Update test_mps.py to mark kaiser_window and i0 as passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849
Approved by: https://github.com/Skylion007
2024-10-14 22:50:01 +00:00
Nikita Shulga
ad38bad766 [MPS] Add tri[lu]_indices (#137648)
Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980
Copy-n-paste kernel implementation from 13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)

though use `float` instead of `double` for square root computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601, #137647
2024-10-10 23:41:06 +00:00
Nikita Shulga
13cf8360d8 [MPS] Fix testing for generator operators (#137601)
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same

Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`

Discovered that:
 - `torch.randn` and `torch.randint` fall into the same undefined category
 - `torch.logspace` is not implemented for MPS
 -  Allow 1.0  absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
 - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
 - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
2024-10-09 23:17:11 +00:00
Nikita Shulga
3d0cb81594 [MPS] Enable bfloat16 testing (#136987)
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
2024-10-01 17:10:07 +00:00
Tom Ritchford
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
Nikita Shulga
4af03e54b7 [MPS][BE] Use None as alias for all types (#137004)
Test like `new_*` and `empty_*` fail the current implementation, see
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003
2024-09-30 19:06:13 +00:00
Nikita Shulga
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
Nikita Shulga
283bda01aa [MPS] Error checking/bf16 support for torch.normal (#136863)
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
2024-09-27 21:11:59 +00:00
Nikita Shulga
9d72f7481b [MPS] Fix AvgPool2d for float16 (#136822)
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
2024-09-27 15:32:18 +00:00
Nikita Shulga
2b6f4e9e24 [BE][MPS] Delete MacOS12 low-precision ops (#136821)
`norm` and `masked.normalize` still have to stay in the list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755
2024-09-27 15:32:18 +00:00
Nikita Shulga
69bd13d12e [EZ][BE] Add torch.complex to MPS_DTYPES (#136755)
As minimal supported OS has been rasied to MacOS 13, some basic complex operations  should be supported, and the rest could be `xfailed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754
2024-09-27 05:01:40 +00:00
Roy Hvaara
5789f8d5dc [MPS] Add regression test for large inputs to F.linear (#136084)
This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13.

~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-26 20:46:14 +00:00
Nikita Shulga
68579ef665 [EZ][MPS] Extend arange to bfloat16 (#136754)
RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES`

Fixes https://github.com/pytorch/pytorch/issues/136624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754
Approved by: https://github.com/Skylion007
2024-09-26 15:33:45 +00:00
Nikita Shulga
73ec76ed50 [MPS] Implement isposinf and isneginf (#136689)
Not sure, why `isinf` is a composite op, but those needs to be implemented by hand.

Implementation is a trivial call to
```objc
[mpsGraph equalWithPrimaryTensor:input
                 secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity()
                                                     dataType:input.dataType]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689
Approved by: https://github.com/Skylion007
2024-09-26 15:33:20 +00:00
Nikita Shulga
c6192f32f1 [MPS] Add upsample_bicubic2d as Metal op (#136123)
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
 - Switch from 2D dispatch to 1D one (to match CUDA behavior)
 - Added batch + channel loops
 - Fixed scale computation to match align corners behavior
 - Added backward implementation

Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
    device atomic<int>* data,
    long offset,
    float value) {
  auto ptr = data + (offset >> 1);
  auto old = atomic_load_explicit(ptr, memory_order_relaxed);
  union {
    int i;
    T t[2];
  } val;
  do {
    val.i = old;
    val.t[offset & 1] += static_cast<T>(value);
  } while (!atomic_compare_exchange_weak_explicit(
      ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
2024-09-24 18:58:11 +00:00
Nikita Shulga
f6f1504d39 [MPS] Fix 5D+ reductions over negative dimentions (#136198)
This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions

Added regresion test case to `TestMPS.test_sum`

Fixes https://github.com/pytorch/pytorch/issues/136132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198
Approved by: https://github.com/albanD
2024-09-17 21:53:31 +00:00
PyTorch MergeBot
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd3.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
PyTorch MergeBot
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
Tom Ritchford
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
Tom Ritchford
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
CaoE
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
PyTorch MergeBot
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c5206133.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
CaoE
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
Tom Ritchford
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
Roy Hvaara
09287e3af4 [MPS] Add regression test for fft.fftfreq (#135440)
The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it.

Fixes #135223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440
Approved by: https://github.com/ezyang
2024-09-09 17:12:36 +00:00
Kulin Seth
144fde4fd2 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Need to run inductor/test_cpu_select_algorithm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Roy Hvaara <roy@lightyear.no>
2024-09-05 23:23:17 +00:00
Tobias Ringwald
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
Joona Havukainen
92f282ca52 Enable batch matmul for result sizes > 2**32 the tensor can be split along batch axis (#133430)
Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.

Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:

```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```

Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.

Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-30 14:08:43 +00:00
Li-Huai (Allan) Lin
e7711d6c7d [MPS] Fix SDP training (#134719)
Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit.

Fixes #134678
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719
Approved by: https://github.com/malfet
2024-08-29 01:28:53 +00:00
Nikita Shulga
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
Roy Hvaara
43f78bf37a [MPS] Gather sliced inputs to batch norm (#133610)
This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in 4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372) to decide if gathering is necessary.

It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs.

### Performance impact

#### With fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 282 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 448 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 705 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 1.11 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.16 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 11.7 msec per loop
```

#### Without fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 284 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 265 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 715 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 675 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.19 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 7.13 msec per loop
```

Please feel free to push back or request changes.

Fixes #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610
Approved by: https://github.com/malfet
2024-08-20 18:24:48 +00:00
Denis Vieriu
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
Sun, Jiayi
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
Li-Huai (Allan) Lin
cc1cc71c46 [MPS] Fix relu for 0-element input case (#133191)
Fixes #133182

Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191
Approved by: https://github.com/albanD
2024-08-12 19:24:17 +00:00
PyTorch MergeBot
2764bee942 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6919e8baab.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](6919e8baab) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))
2024-08-05 19:59:04 +00:00
Kulin Seth
6919e8baab [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-08-05 17:02:30 +00:00
Oguz Ulgen
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
ekamiti
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
Li-Huai (Allan) Lin
964f97539f [MPS] Correct nonzero warning and fix the test (#132127)
#125355 lifted the natively supported macOS version to 14.

Fixes #132110
Probably fixes this flaky test disabling issue: #126492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127
Approved by: https://github.com/malfet
2024-07-30 19:46:25 +00:00
Li-Huai (Allan) Lin
a147fa577b [MPS] Fix masked_fill_ in non_contiguous cases (#131957)
fixes #131285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957
Approved by: https://github.com/DenisVieriu97
2024-07-30 01:34:48 +00:00
Tom Ritchford
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
Joona Havukainen
082d0b80ca Min and max NaN propagation fix in MPS backend (#130445)
Partial fix to issue #130295

Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445
Approved by: https://github.com/malfet
2024-07-29 20:09:15 +00:00
Tom Ritchford
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
Manuel Candales
d6115439be [MPS] Add SDPA implentation (#131362)
This work is based off @malfet's #119200

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362
Approved by: https://github.com/kimishpatel
2024-07-25 03:24:37 +00:00
Tom Ritchford
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
Joona Havukainen
102d8e5a63 MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038)
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.

Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
2024-07-23 06:30:40 +00:00
Tom Ritchford
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
Xuehai Pan
d2bd9acabd [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
2024-07-20 02:41:10 +00:00
Li-Huai (Allan) Lin
8ea03372a1 [MPS] Store philox counter as part of the RNG state (#130662)
Fixes #130613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662
Approved by: https://github.com/malfet
2024-07-18 15:57:28 +00:00
PyTorch MergeBot
074a5c0c9b Revert "[BE] bump optree version to 0.12.1 (#130139)"
This reverts commit 8fcb156e8b.

Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py 8fcb156e8b ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))
2024-07-15 19:42:11 +00:00
Xuehai Pan
8fcb156e8b [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
2024-07-15 17:27:07 +00:00
Nikita Shulga
febadda107 [MPS] Fix torch.[all|any] for 5+D tensors (#130542)
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.

Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue

Enabled `test_reduction_ops_5D` and  added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
2024-07-12 15:06:22 +00:00
PyTorch MergeBot
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b8.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
PyTorch MergeBot
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb21098.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Jiang, Yanbing
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
Tom Ritchford
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
Li-Huai (Allan) Lin
99967e1119 [MPS][TYPE_PROMOTION] Fix Clamp (#130226)
Summary:
1. Fixed #130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:

```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype  # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype  # torch.float32
```

~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
2024-07-10 14:27:39 +00:00
PyTorch MergeBot
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
Jiang, Yanbing
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
Tom Ritchford
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
Joel Schlosser
c8ab2e8b63 Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238)
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
    * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
    * Test breakdown (>100 total)
        * A lot of tolerance issues (tweaked tolerance values to fix)
        * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
        * 3 actually broken semantics (for masked tensor; added xfails)
        * 4 Jacobian mismatches (added xfails)
        * 2 nan results (skip for now, need fixing)
        * 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.

**before (no seed setting):**
```
real	0m21.306s
user	0m19.053s
sys	0m5.192s
```

**after (with seed setting):**
```
real	0m21.905s
user	0m19.578s
sys	0m5.390s
```

* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.

```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
    self.assertFalse(True)
AssertionError: True is not false

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
    fn(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
    raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.037s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
2024-07-08 16:06:38 +00:00
PyTorch MergeBot
07450e9713 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6240cfd5c7.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))
2024-07-02 12:29:51 +00:00
Kulin Seth
6240cfd5c7 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-07-02 01:49:52 +00:00
Huy Do
fdd0a7f9b4 Run test_mps_allocator_module serially (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-01 18:44:48 +00:00
Joona Havukainen
5b96a552df Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484)
Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value.

The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484
Approved by: https://github.com/kulinseth
2024-06-28 20:57:40 +00:00
Manuel Candales
eabe6574c0 [metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628
Approved by: https://github.com/kimishpatel
2024-06-28 15:01:30 +00:00
Nikita Shulga
bc68907caa [EZ][BE] Replace assertTrue with more appropriate checks (#129569)
Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e.
- `assertTrue(x == y)` -> `assertEqual(x, y)
- `assertTrue(not x)` -> assertFalse(x)`
- `assertTrue(x > y)` -> assertGreater(x, y)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
2024-06-26 16:29:59 +00:00
PyTorch MergeBot
b045878f81 Revert "Remove test_mps_allocator_module XFAIL (#129340)"
This reverts commit c888ee3632.

Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))
2024-06-25 18:37:54 +00:00
Isuru Fernando
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
Isuru Fernando
5f912f480c Fix max_pool2d decomposition for empty list and integer limits (#129106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106
Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #129096, #129097
2024-06-24 22:19:42 +00:00
Huy Do
c888ee3632 Remove test_mps_allocator_module XFAIL (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980
2024-06-24 16:26:38 +00:00
Manuel Candales
749c03406c [metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965)
Adds _convert_weight_to_int4pack MPS kernel
Replaces previous int4mm Metal shader, with shader authored by @kimishpatel  which improves perf by ~40%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965
Approved by: https://github.com/malfet
2024-06-23 02:10:46 +00:00
Li-Huai (Allan) Lin
799acd31b4 [MPS] Add lu_factor (#99269)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d75cde1</samp>

Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269
Approved by: https://github.com/kulinseth, https://github.com/lezcano
2024-06-20 07:35:29 +00:00
Li-Huai (Allan) Lin
9a7e2519d3 [MPS] Fused Adam & AdamW (#127242)
Summary:

This PR adds fused Adam and AdamW implementations.

Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory:
**Fast math enabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        89
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        90
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        83
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       12      |        94
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       11      |        88
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       12      |        90
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |       100
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       27      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       23      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       27      |       100
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       23      |        98
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       82      |       480
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       72      |       450
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       82      |       450
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       73      |       420
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       91      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       83      |       400
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |       94      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       78      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      170      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      140      |       600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      170      |       600
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      140      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      250      |       890
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      220      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      250      |       830
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      220      |       770
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      270      |       870
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      230      |       840
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      270      |       810
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      240      |       800
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      400      |      1000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      360      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      430      |      2000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      360      |      1300

Times are in milliseconds (ms).
```

**Fast math disabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        79
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       11      |        93
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       10      |        90
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       11      |        91
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |        81
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       34      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       31      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       34      |        95
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       31      |       100
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       94      |       500
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       82      |       430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       92      |       430
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       81      |       390
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       98      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       88      |       430
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |      100      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       88      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      210      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      190      |       610
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      210      |       510
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      190      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      300      |       900
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      260      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      295      |       900
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      260      |       800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      320      |       910
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      280      |       900
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      320      |       900
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      300      |       900
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      500      |      2000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      480      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      540      |      1500
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      480      |      1200

Times are in milliseconds (ms).
```

```python
def profile_fused_adam():
    from torch.optim import adam, adamw
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
        fn(
            params,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            foreach=False,
            capturable=False,
            fused=fused,
            amsgrad=amsgrad,
            beta1=0.9,
            beta2=0.99,
            lr=1e-3,
            weight_decay=.0,
            eps=1e-5,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]):
        print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
        params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
        max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else []
        state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)]
        if adamWflag:
            fn = adamw.adamw
        else:
            fn = adam.adam

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                    label='Fused Adam',
                    sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242
Approved by: https://github.com/kulinseth, https://github.com/janeyx99
2024-06-18 19:59:50 +00:00
Joona Havukainen
d9eaa224f2 Fixes #128429: NaN in triu op on MPS (#128575)
Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead.

Fixes #128429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575
Approved by: https://github.com/kulinseth
2024-06-18 03:44:42 +00:00
Nikita Shulga
9035fff2de [BE] Do not test deprecated torch.nn.utils.weight_norm (#128727)
Test `torch.nn.utils.parametrizations.weight_norm` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727
Approved by: https://github.com/kit1980
ghstack dependencies: #128726
2024-06-14 19:14:44 +00:00
Nikita Shulga
27458cc097 [BE] Refactor repeated code in test_weight_norm (#128726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726
Approved by: https://github.com/kit1980
2024-06-14 19:14:44 +00:00
Tom Ritchford
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
Nikita Shulga
0678742924 [MPS] Add Metal implementation of exp op (#128421)
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor)
Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU

Fix bug in non-contiguous tensors handling

Fixes https://github.com/pytorch/pytorch/issues/84936
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421
Approved by: https://github.com/kulinseth
ghstack dependencies: #128373, #128375
2024-06-13 06:53:17 +00:00
Kulin Seth
8df56afc20 Add support in Python API for the recommended max working set size. (#128289)
Adds ways for users to request recommended max size for Metal on Mac. It plumbs through
https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc

Can be used like
```
        max_memory = torch.mps.recommended_max_memory()
        print ("Recommended Max Memory : ", (max_memory/(1024*1024*1024)), "GB")
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289
Approved by: https://github.com/malfet
2024-06-12 16:03:57 +00:00
Tom Ritchford
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
Joona Havukainen
a5ba9b2858 Fix for addcdiv contiguous problem (#124442)
Fixes issue number #118115
Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442
Approved by: https://github.com/kulinseth
2024-06-06 16:09:18 +00:00
Huy Do
8992141dba Restore MPS testing on MacOS 13 and m2 metal (#127853)
The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out.

This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582.

The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853
Approved by: https://github.com/malfet
2024-06-05 14:44:00 +00:00
PyTorch MergeBot
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
Isuru Fernando
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
Nikita Shulga
045309aa35 [MPS] Enable toch.mm and friends for complex dtypes (#127241)
- Add `supportedFloatingOrComplexType`
- Change dtype check to those
- Extend low-precision fp32 list to complex types
- Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype

Fixes https://github.com/pytorch/pytorch/issues/127178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241
Approved by: https://github.com/janeyx99
2024-05-28 17:56:13 +00:00
Nikita Shulga
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
jhavukainen
6a539e80dd Update descriptor fields to resolve fft precision issue (#125328)
Fixes #124096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-05-22 21:48:49 +00:00
jhavukainen
d28868c7e8 Change skipIfs to xfails in test_mps.py for test_isin (#125412)
Follow-up to #124896 to move the added test to use expectedFailure instead of skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412
Approved by: https://github.com/kulinseth
2024-05-20 20:23:53 +00:00
Nikita Shulga
b8a706a321 [EZ][BE] Use untyped_storage in tests (#125838)
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if base.storage().data_ptr() != other.storage().data_ptr():
```

(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )

Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
2024-05-09 14:04:21 +00:00
Nikita Shulga
4e29e80bf0 Run MPS tests on MacOS Sonoma (#125801)
Those ones are running 14.4.1, so I wonder if they actually pass CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-05-09 13:43:12 +00:00
Denis Vieriu
58e045d03c [MPS] Fix strided ELU op (#125692)
Fixes https://github.com/pytorch/pytorch/issues/124834

Summary of changes:

In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692
Approved by: https://github.com/kulinseth
2024-05-08 01:34:40 +00:00
Denis Vieriu
ba27548679 [MPS] Remove in place views (causes too many crashes) (#124895)
Fixes https://github.com/pytorch/pytorch/issues/96153

Remove in place views as they are a general cause for many crashes.
Proper fix to handle views without copies will come in a different PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895
Approved by: https://github.com/kulinseth
2024-05-08 01:00:37 +00:00
Denis Vieriu
3fb53bb6a7 [MPS] Fix strided mse_loss (#125696)
Fixes https://github.com/pytorch/pytorch/issues/124621

Summary of changes:
- In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696
Approved by: https://github.com/kulinseth
2024-05-08 00:52:26 +00:00
Nikita Shulga
0fd1fc17c3 [MPS] Fix abs for complex types (#125662)
By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura.

Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks

Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types

Fixes https://github.com/pytorch/pytorch/issues/125135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662
Approved by: https://github.com/kulinseth
2024-05-07 22:15:20 +00:00
Nikita Shulga
30610251ec [MPS] And naive quantized intmm and .gputrace capture hooks (#125163)
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
  ```cpp
   if (getMPSProfiler().profiler.isCaptureEnabled()) {
     getMPSProfiler().startCapture(__func__, mpsStream);
   }
   ...
   if (getMPSProfiler().isCapturing()) {
     getMPSProfiler().stopCapture(mpsStream);
   }
  ```
  that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">

- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`

TODOs:
 - Add weight pack
 - Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
2024-05-03 15:20:39 +00:00
Denis Vieriu
a40d6df448 [MPS] Native nonzero implementation (#125355)
Fixes https://github.com/pytorch/pytorch/issues/124850

Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355
Approved by: https://github.com/kulinseth
2024-05-03 03:50:58 +00:00
Roy Hvaara
e15da7856c [MPS] Fix overflow in cumsum when dtype is bool (#125318)
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)

A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.

It turns out this issue also affects boolean types:

```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```

In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.

Fixes #96614 #106112 #109166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
2024-05-03 01:19:24 +00:00
Joona Havukainen
c451d108da Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
Addresses issue #124518, adds isin_Tensor_Tensor_out.

Tests added to test_mps.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-05-01 23:14:05 +00:00
Nikita Shulga
5944a53555 [MPS] Fix nextafter for negative values (#125029)
By changing the logic to on older MacOS:
```cpp
bits += ((input > 0) ^ (input > other)) ? 1 : -1;
```
And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available)

TODO:
  - Add tests for infs and denorms

Fixes https://github.com/pytorch/pytorch/issues/124985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029
Approved by: https://github.com/Skylion007
2024-04-27 02:58:05 +00:00
Nikita Shulga
db3a2d751c [MPS][BE] Error-check linear (#124952)
Validate that all arguments are on MPS devices and dtypes are expected

Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```

Fixes https://github.com/pytorch/pytorch/issues/123995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007
2024-04-25 23:25:20 +00:00
Nikita Shulga
abf3f90781 [MPS] Fix large copy (#124635)
By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks

Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with:
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```

Fixes https://github.com/pytorch/pytorch/issues/124335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635
Approved by: https://github.com/kulinseth
2024-04-22 23:43:11 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
Joël Tang
a6a3f2e06b [MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.

I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.

I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.

Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
2024-04-21 00:12:32 +00:00
Nikita Shulga
5677128cb8 [MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258)
By creating constants using input tensors dtype

One line reproducer:
```
python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))"
```

Before the change
```
loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```
After
```
tensor(-33.7812, device='mps:0', dtype=torch.float16)
```

Fixes https://github.com/pytorch/pytorch/issues/124252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258
Approved by: https://github.com/kulinseth
2024-04-18 15:21:01 +00:00
xinan.lin
6fcbeb3489 [ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256)
Add CPU FP16 support for nll_loss and cross_entropy_loss.
Resolve issue #123328.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-04-18 11:44:38 +00:00
Pearu Peterson
d2b0c0a34e Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375)
As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375
Approved by: https://github.com/zou3519, https://github.com/peterbell10
2024-04-17 15:31:28 +00:00
FFFrog
acc466751b Add bfloat16 support to binary_cross_entropy for CPU (#123823)
Fixes #123715

As the title stated.

But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-04-17 09:44:07 +00:00
Joona Havukainen
05289a278c Fix for MPS regression in #122016 and #123178 (#123234)
Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.

Before the fix on MacOS 14.4:

```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```

After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```

This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-04-03 23:00:57 +00:00
PyTorch MergeBot
feabb645a7 Revert "Handle transposes in second batch of matrices in bmm (#122194)"
This reverts commit 251ad1232b.

Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))
2024-04-02 18:49:28 +00:00
Kulin Seth
251ad1232b Handle transposes in second batch of matrices in bmm (#122194)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194
Approved by: https://github.com/DenisVieriu97
2024-04-02 17:48:35 +00:00
Nikita Shulga
4c70ab26ef [MPS] Enable index_select for complex types (#122590)
Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick

Fixes https://github.com/pytorch/pytorch/issues/122427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590
Approved by: https://github.com/Skylion007
2024-03-25 16:57:35 +00:00
andrewor14
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
Roger Lam
40acc84aaf Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-18 19:38:15 +00:00
PyTorch MergeBot
0cc60a05da Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381)"
This reverts commit ca80d07ac7.

Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR.  Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))
2024-03-15 23:53:05 +00:00
Roger Lam
ca80d07ac7 Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-15 21:54:50 +00:00
Nikita Shulga
5498804ec2 [MPS] Fix naive matmul for BFloat16 (#121731)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD
2024-03-13 14:34:03 +00:00
Nikita Shulga
07330ff7b6 [MPS][BE] Define _compute_tolerances (#121754)
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.

Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
2024-03-13 04:08:06 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
Boyuan Feng
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
Nikita Shulga
9b03a06288 [BE] [MPS] Fix out resize logic in torch.where (#121476)
By deleting `where_mps`  and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and  CUDA backends.

Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
2024-03-08 18:59:37 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Kai
c59b14163b Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2024-02-26 23:04:52 +00:00
Nikita Shulga
53bfae2c06 [MPS] Add torch.fft. support (#119670)
Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 )

When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing.

Fixes https://github.com/pytorch/pytorch/issues/78044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670
Approved by: https://github.com/kulinseth, https://github.com/albanD
2024-02-20 18:23:06 +00:00
Nikita Shulga
eb9a3383c2 [MPS] Add naive std_mean implementation (#119777)
By just calling `std_mps` and `mean` in sequence

Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer

import torch
from torch.utils.benchmark import Measurement, Timer

def bench_var_mean(
    m, n, k,
    dtype = torch.float32,
    device:str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()

for x in [100, 1000]:
    rc = bench_var_mean(1000, x, 100, device="mps")
    print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.

Fixes https://github.com/pytorch/pytorch/issues/119663

TODOs:
 - Refactor the codebase and implement proper composite function (that must be faster)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
2024-02-13 21:51:29 +00:00
Nikita Shulga
15ef52a015 [MPS] Enable conj and conj_physical (#119669)
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor

Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
2024-02-13 02:27:51 +00:00
Nikita Shulga
8d8fb9783c [MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681)
By using `view_as_real` when type casting between two complex types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-12 19:09:10 +00:00
Pearu Peterson
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
Nikita Shulga
4ee8aac432 [MPS] Enable bfloat16 support on MacOS 14 (#119641)
Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later

Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h`

TODO: Enable more testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641
Approved by: https://github.com/Skylion007
2024-02-11 16:25:29 +00:00
Nikita Shulga
1d61011c11 [MPS] Add support for complex scalars (#119318)
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float)  and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`

Fixes https://github.com/pytorch/pytorch/issues/119088

Test plan: CI (have quite a lot of tests, see new unexpected successes) +  `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
2024-02-08 18:10:59 +00:00
watarungurunnn
d444a3b443 [MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771)
Fixes #114285

(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
2024-02-05 15:36:55 +00:00
lancerts
26a2743162 Fix placeholder tensor is empty for relu in mps (#118965)
Fixes #118845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965
Approved by: https://github.com/malfet
2024-02-03 23:50:35 +00:00
Nikita Shulga
24dd9f42ce [MPS] Fix use_metal_mm condition (#118830)
One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)`

Extend test to account for this situation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830
Approved by: https://github.com/huydhn
2024-02-01 17:53:42 +00:00
Yifu Wang
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
Sun, Jiayi
2dd4a254a0 add Half support for interpolate operators on CPU (#105648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 09:07:16 +00:00
Nikita Shulga
1872834247 [MPS] Fix torch.mm correctness for large matrices (#117549)
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
  NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
  for (int64_t i = 0; i < M; i += tile_size) {
    const auto i_end = std::min(i + tile_size, M);
    NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
    for (int64_t j = 0; j < K; j += tile_size) {
      const auto j_end = std::min(j + tile_size, K);
      MPSGraphTensor* tile = nil;
      for (int64_t k = 0; k < N; k += tile_size) {
        const auto k_end = std::min(k + tile_size, N);
        auto selfChunk = [graph sliceTensor:selfTensor
                                     starts:@[ @(i), @(k) ]
                                       ends:@[ @(i_end), @(k_end) ]
                                    strides:@[ @(1), @(1) ]
                                       name:nil];
        auto otherChunk = [graph sliceTensor:otherTensor
                                      starts:@[ @(k), @(j) ]
                                        ends:@[ @(k_end), @(j_end) ]
                                     strides:@[ @(1), @(1) ]
                                        name:nil];
        auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];

        tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
      }
      [row_chunks addObject:tile];
    }
    auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
    [rows addObject:row];
  }
  return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```

One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
2024-01-17 01:33:08 +00:00
Nikita Shulga
6784030df4 [MPS] Add support for 64-bit index operations (#116942)
But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940
2024-01-09 16:56:49 +00:00
Nikita Shulga
ff0f79d3c7 [MPS] Mark torch.[all|any] as working with complex on MacOS14 (#116907)
It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907
Approved by: https://github.com/osalpekar, https://github.com/kit1980
2024-01-06 01:10:11 +00:00
Nikita Shulga
b0393ebe9b [MPS] Make test_mps.py passable on Sonoma (#116764)
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
2024-01-05 00:25:47 +00:00
Gao Tianlin
6793b99107 [BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457)
The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault.

If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal:
```
[call] op=[aten::quantize_per_tensor], key=[AutogradCPU]
  [redispatch] op=[aten::quantize_per_tensor], key=[CPU]
 [call] op=[aten::any.dims], key=[AutogradCPU]
  [redispatch] op=[aten::any.dims], key=[QuantizedCPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::any.dims_out], key=[QuantizedCPU]
    [call] op=[aten::any.dims], key=[QuantizedCPU]
     [call] op=[aten::empty.memory_format], key=[BackendSelect]
      [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::any.dims_out], key=[QuantizedCPU]
      [call] op=[aten::any.dims], key=[QuantizedCPU]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
       [call] op=[aten::any.dims_out], key=[QuantizedCPU]
        [call] op=[aten::any.dims], key=[QuantizedCPU]
         [call] op=[aten::empty.memory_format], key=[BackendSelect]
          [redispatch] op=[aten::empty.memory_format], key=[CPU]
         [call] op=[aten::any.dims_out], key=[QuantizedCPU]
          [call] op=[aten::any.dims], key=[QuantizedCPU]
           [call] op=[aten::empty.memory_format], key=[BackendSelect]
            [redispatch] op=[aten::empty.memory_format], key=[CPU]
           [call] op=[aten::any.dims_out], key=[QuantizedCPU]
            [call] op=[aten::any.dims], key=[QuantizedCPU]
             [call] op=[aten::empty.memory_format], key=[BackendSelect]
              [redispatch] op=[aten::empty.memory_format], key=[CPU]
             [call] op=[aten::any.dims_out], key=[QuantizedCPU]
              [call] op=[aten::any.dims], key=[QuantizedCPU]
               [call] op=[aten::empty.memory_format], key=[BackendSelect]
                [redispatch] op=[aten::empty.memory_format], key=[CPU]
               [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                [call] op=[aten::any.dims], key=[QuantizedCPU]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                  [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                  [call] op=[aten::any.dims], key=[QuantizedCPU]
.....
.....
.....
```

Fixes #116452
Fixes #116451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457
Approved by: https://github.com/malfet
2024-01-04 17:37:17 +00:00
Aaron Gokaslan
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
Nikita Shulga
09ee96b69d [MPS] Fix CrossEntropyLoss for float16 (#116597)
Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e.
> `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)`

Alas, at the moment  it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU)

Fixes https://github.com/pytorch/pytorch/issues/116095

TODO: Enable testing via TestModules, but will do in separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597
Approved by: https://github.com/kulinseth
2024-01-03 05:58:26 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
PyTorch MergeBot
0978482afa Revert "Implement aten::upsample_linear1d on mps (#115031)"
This reverts commit c6969cb8a9.

Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))
2023-12-26 18:01:49 +00:00
Kai
c6969cb8a9 Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2023-12-26 15:44:21 +00:00
Aaron Gokaslan
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
Sun, Jiayi
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
Nikita Shulga
9dda4b20a0 [MPS] Enable select/[broad]cast ops for complex dtypes (#115727)
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
2023-12-19 02:25:28 +00:00
Peter Pham
74dfdc567b [MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801)
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706

The follow test would fail without this bug fix:

```
import torch
def test_erfinv():
    for device in ['cpu', 'mps']:
        x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
        y = x[2:].erfinv()

        x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
        y2 = x2.erfinv()

        print(y)
        print(y2)

        torch.testing.assert_close(y, y2)
        print(f"{device} passes.")

test_erfinv()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
2023-12-15 23:14:03 +00:00
Lucas Steuernagel
2e517b20d9 [MPS] Add Conv3D support for MPS (#114183)
Fixes #77818

I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:

> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)

Please, let me know if there are any other changes needed and I'll be happy to implement them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
2023-12-15 23:05:01 +00:00
mingfeima
a8acd6c410 Add Half support for AvgPool2d on CPU (#109578)
Add Half support for AvgPool2d (both channels last and channels first) on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-12-12 12:59:47 +00:00
igm503
f017a1af3f [MPS] add complex_out to MPS backend (#110851)
Adds support for at::complex_out to the MPS backend

Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851
Approved by: https://github.com/kulinseth
2023-12-11 13:37:55 +00:00
Li-Huai (Allan) Lin
38e1440bae [MPS] Remove redundant topk test and move all pad tests inside a class (#113313)
Summary:
1. The removed `topk` test is essentially very similar to the following test, so I remove it:
```python
def test_topk(self):
        def helper(shape):
            cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
            x = cpu_x.detach().clone().to('mps')
            for largest_val in [True, False]:
                if (type(shape) == tuple):
                    for curr_dim in range(0, len(shape)):
                        dim_size = shape[curr_dim]
                        for k in range(1, dim_size + 1):
                            topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val)
                            topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val)
                            self.assertEqual(topk_values, topk_values_cpu)
                            self.assertEqual(topk_indices, topk_indices_cpu)
                else:
                    for k in range(1, shape):
                        topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val)
                        topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val)
                        self.assertEqual(topk_values, topk_values_cpu)
                        self.assertEqual(topk_indices, topk_indices_cpu)

        helper(2)
        helper((5, 1))
        helper((1, 5))
        helper((5, 9, 7, 4))
        helper((50, 20, 7, 4))
```
297c26bb8e/test/test_mps.py (L8054-L8091)

2. Move all pad tests to one standalone class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313
Approved by: https://github.com/kulinseth
ghstack dependencies: #113312
2023-12-01 06:52:07 +00:00
Li-Huai (Allan) Lin
88a659e752 [MPS] Move non-nll loss tests outside TestNLLLoss (#113312)
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
2023-12-01 06:52:07 +00:00
Nikita Shulga
1b27eae65e [MPS] Fix out-of-bounds fill to sliced tensor (#114838)
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros

Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset.

In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero

Add regression test

Fixes https://github.com/pytorch/pytorch/issues/114692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838
Approved by: https://github.com/atalman, https://github.com/kulinseth
2023-12-01 06:24:42 +00:00
Khushi Agrawal
cff84871ce [reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#114589)
Previous PR: #113885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589
Approved by: https://github.com/lezcano
2023-11-27 14:45:44 +00:00
PyTorch MergeBot
150aaf46ca Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)"
This reverts commit 4fa1ff8404.

Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk 4fa1ff8404 ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473))
2023-11-27 07:33:00 +00:00
Khushi Agrawal
4fa1ff8404 [opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops (#113885)
Previous PR: https://github.com/pytorch/pytorch/pull/85202

Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885
Approved by: https://github.com/lezcano
2023-11-26 13:44:30 +00:00
Nikita Shulga
324cde59b2 [MPS] Fix test_copy_cast_no_leak (#114313)
When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization.

Fixes https://github.com/pytorch/pytorch/issues/114305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313
Approved by: https://github.com/huydhn
2023-11-22 14:48:24 +00:00
Nikita Shulga
b5dd37f23e [MPS] Fix memory leak in copy_from_mps_ (#114197)
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL

======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
    method(*args, **kwargs)
  File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
    self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory

To execute this test, run the following from the base repo dir:
     python test/test_mps.py -k test_copy_cast_no_leak

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 1.102s

FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
...
```

Fixes https://github.com/pytorch/pytorch/issues/114096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
2023-11-21 14:52:55 +00:00
Li-Huai (Allan) Lin
538114db65 [MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085)
Fixes #100764

This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.

For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.

For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125

See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
2023-11-14 22:03:21 +00:00
Nikita Shulga
265d6aac0b [MPS] Fix crashes during Conv backward pass (#113398)
By adding weights tensor to the MPSGraph cache key.
Add regression test to validate that collision no longer happens

Fixes https://github.com/pytorch/pytorch/issues/112998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398
Approved by: https://github.com/kulinseth
2023-11-10 04:29:33 +00:00
Li-Huai (Allan) Lin
740137df6f [MPS] Add bucketize op (#112830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830
Approved by: https://github.com/kulinseth, https://github.com/malfet
ghstack dependencies: #112829
2023-11-07 17:22:08 +00:00
Li-Huai (Allan) Lin
c4bb77323d [MPS] Add searchsorted op (#112829)
The metal kernels implemented are closely following `Bucketization.cu`.

Benchmark:
```
[----------------------------- searchsorted ----------------------------]
                                                         |  cpu   |  mps
1 threads: --------------------------------------------------------------
      Batch size: 8; In features: 64; Sorter: True       |    44  |   530
      Batch size: 8; In features: 64; Sorter: False      |    31  |    12
      Batch size: 8; In features: 256; Sorter: True      |   131  |   520
      Batch size: 8; In features: 256; Sorter: False     |   107  |    12
      Batch size: 8; In features: 1024; Sorter: True     |   499  |   590
      Batch size: 8; In features: 1024; Sorter: False    |   398  |    12
      Batch size: 16; In features: 64; Sorter: True      |    71  |   540
      Batch size: 16; In features: 64; Sorter: False     |    57  |    12
      Batch size: 16; In features: 256; Sorter: True     |   242  |   610
      Batch size: 16; In features: 256; Sorter: False    |   200  |    12
      Batch size: 16; In features: 1024; Sorter: True    |   999  |   720
      Batch size: 16; In features: 1024; Sorter: False   |   842  |    12
      Batch size: 32; In features: 64; Sorter: True      |   124  |   509
      Batch size: 32; In features: 64; Sorter: False     |   103  |    12
      Batch size: 32; In features: 256; Sorter: True     |   477  |   650
      Batch size: 32; In features: 256; Sorter: False    |   407  |    12
      Batch size: 32; In features: 1024; Sorter: True    |  1940  |   833
      Batch size: 32; In features: 1024; Sorter: False   |  1710  |    12
      Batch size: 64; In features: 64; Sorter: True      |   231  |   590
      Batch size: 64; In features: 64; Sorter: False     |   194  |    12
      Batch size: 64; In features: 256; Sorter: True     |   937  |   710
      Batch size: 64; In features: 256; Sorter: False    |   800  |    13
      Batch size: 64; In features: 1024; Sorter: True    |  3980  |  1290
      Batch size: 64; In features: 1024; Sorter: False   |  3330  |    12
      Batch size: 128; In features: 64; Sorter: True     |   448  |   650
      Batch size: 128; In features: 64; Sorter: False    |   390  |    13
      Batch size: 128; In features: 256; Sorter: True    |  1830  |   850
      Batch size: 128; In features: 256; Sorter: False   |  1590  |    12
      Batch size: 128; In features: 1024; Sorter: True   |  7790  |  2850
      Batch size: 128; In features: 1024; Sorter: False  |  6670  |    13
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829
Approved by: https://github.com/malfet
2023-11-07 17:22:08 +00:00
CaoE
455241bbd3 Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138)
Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138
Approved by: https://github.com/cpuhrsch
2023-11-06 06:01:29 +00:00
CaoE
26b5e27ace Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132)
Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132
Approved by: https://github.com/cpuhrsch
2023-11-05 12:31:38 +00:00
Li-Huai (Allan) Lin
30237aaeec [MPS] Fix bug when value is of complex (#111937)
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
2023-10-31 17:50:56 +00:00
CaoE
a310cc8968 Add Half support for kthvalue, cross, hist, and logit on CPU (#112135)
Add Half support for kthvalue, cross, hist, and logit on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135
Approved by: https://github.com/cpuhrsch
2023-10-31 09:12:47 +00:00
Peter Bell
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
Cao E
1c89ea7f72 Add Half support for softmax and log_softmax on CPU (#103315)
Add Half support for softmax and log_softmax on CPU.
Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet
2023-10-26 08:38:54 +00:00
Peter Bell
46e80ce58a [ATen] Support multi dim any and all reductions (#110310)
This adds a new overload to `all` and `any` with support for multiple reduction dims.
```
all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/justinchuby
2023-10-24 21:33:53 +00:00
Li-Huai (Allan) Lin
4b804dac33 [MPS] Add complex support for fill (#111885)
Fixes #110537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885
Approved by: https://github.com/malfet
2023-10-24 06:41:10 +00:00
CaoE
4b324a8717 Add Half support for aminmax on CPU (#106853)
Add Half support for aminmax on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853
Approved by: https://github.com/cpuhrsch
2023-10-23 17:43:47 +00:00
CaoE
d1afb7d43d add Half support for multinomial on CPU (#104178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
CaoE
2a40b7efcb Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319)
Add Half support for addcmul, addcdiv, cumsum, and topk on CPU.
Note: This PR will introduce the issue  https://github.com/pytorch/pytorch/issues/111454.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-10-19 17:47:45 +00:00
CaoE
8713a1a363 add Half support for bernoulli on CPU (#104176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00
Kurt Mohler
5292a92e03 Add torch.unravel_index (#110580)
Fixes #35674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580
Approved by: https://github.com/lezcano, https://github.com/kulinseth
2023-10-12 00:55:51 +00:00
igm503
95ff51d8ed [MPS] Add support for Softshrink to MPS Backend (#110814)
Adds the softshrink activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814
Approved by: https://github.com/kulinseth
2023-10-11 07:55:39 +00:00
igm503
4b881b0da3 [MPS] add support for sgn to MPS backend (#110829)
Fixes #86805

Adds support for sgn to MPS backend.

Notes:

1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.

2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
2023-10-09 16:53:25 +00:00
vfdev-5
d2a2a67fa4 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-09 10:55:56 +00:00
igm503
a389181f2e [MPS] add support for aten::nextafter (#109685)
Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591

Adds support for aten::nextafter to the MPS backend. Supports float and half types.

Notes:
- I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions
- Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685
Approved by: https://github.com/kulinseth
2023-10-03 19:20:22 +00:00
PyTorch MergeBot
df3ab70dde Revert "Added new test sample to interpolate op in OpInfo (#104181)"
This reverts commit 87f8bc65f8.

Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))
2023-10-03 18:07:02 +00:00
vfdev-5
87f8bc65f8 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-02 15:35:48 +00:00
CaoE
9399e0b1ff add fp16 support for gemm (#99498)
### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401

56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-09-28 01:03:50 +00:00
Li-Huai (Allan) Lin
ac1e85161e [MPS] Fix nll_loss with default ignore_index (#109574)
`-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies.

Fixes #108148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574
Approved by: https://github.com/kulinseth
ghstack dependencies: #109557
2023-09-26 04:13:09 +00:00
Li-Huai (Allan) Lin
0087118997 [MPS] Fix mps to cpu copy with storage offset (#109557)
Fix #108978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557
Approved by: https://github.com/DenisVieriu97
2023-09-26 04:13:08 +00:00
CaoE
7c9052165a add fp16 support for native conv and deconv on CPU (#99497)
### Testing

Native conv vs. mkldnn conv on SPR (with avx512_fp16 support)

Single core:

Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 34676789 | 524199.8 | 66.15185
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 33454125 | 349844.4 | 95.62573
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 317650.1 | 2317.677 | 137.0554
IC: 128, OC: 256, kernel: 3, stride: 1,   N: 1, L: 64 | 15334.68 | 167.264 | 91.67952

56 cores:
Input | Naïve impl   / us | oneDNN /   us | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 1032064 | 11073.58 | 93.20061
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 1000097 | 16371.19 | 61.08883
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 981813.4 | 9008.908 | 108.9825
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 1082606 | 10150.47 | 106.6558
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 319980.6 | 181.598 | 1762.027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-09-25 01:31:26 +00:00
igm503
255d1a776a [MPS] Add support for Mish to MPS backend (#109786)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444)

Adds the mish activation function to the mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786
Approved by: https://github.com/kulinseth
2023-09-21 21:01:20 +00:00
igm503
0317626df5 [MPS] adding weight_norm_interface support for mps (#108008)
Fixes #104513

Adds support for aten::_weight_norm_interface to the mps backend.

Also adds a consistency test for the output and the grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008
Approved by: https://github.com/kulinseth
2023-09-20 02:18:28 +00:00
CaoE
54c28c564f add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima
2023-09-19 10:43:33 +00:00
PyTorch MergeBot
be9f73f031 Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211)"
This reverts commit fe14e43d14.

Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))
2023-09-14 22:29:12 +00:00
Edward Z. Yang
fe14e43d14 Add meta and OpInfo for _embedding_bag_dense_backward (#109211)
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula.  Check comments.

This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-09-14 18:49:32 +00:00
PyTorch MergeBot
b226373d16 Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit b6a1d3fb97.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main b6a1d3fb97 https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))
2023-09-14 16:13:34 +00:00
CaoE
b6a1d3fb97 add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-14 12:23:59 +00:00
PyTorch MergeBot
04a765f95d Revert "add Half support for BatchNorm on CPU (#102070)"
This reverts commit 6065e7a97c.

Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` 6065e7a97c https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))
2023-09-13 22:38:42 +00:00
Nikita Shulga
916183a012 [MPS] Fix crash if nonzero is called concurrently (#108996)
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`,  which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.

Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device

Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>

> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
2023-09-13 19:28:47 +00:00
CaoE
6065e7a97c add Half support for BatchNorm on CPU (#102070)
Fixes #106543

### Testing

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882

28 cores:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 |  0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-09-13 17:30:16 +00:00
igm503
1b9b3a2d15 [MPS] Adding lgamma, digamma, and polygamma implementations (#106292)
Fixes issue mentioned in #77764

e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744

Adds MPS support for the following ops:

- lgamma
- mvlgamma
- digamma
- polygamma

The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.

NOTE:

The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)

The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
2023-09-12 16:43:37 +00:00
Li-Huai (Allan) Lin
293d3b89d8 Add Opinfos for the Tensor overload of linspace/logspace (#107958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958
Approved by: https://github.com/zou3519
2023-09-11 22:30:19 +00:00
Nikita Shulga
9b12a28d89 [MPS] Implement mul operation for complex types (#108395)
Using existing BinaryKernel template

Add `mul` as well as `kron` and `outer` to list of MPS ops that support complex types

This should add all the missing ops mentioned in https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108395
Approved by: https://github.com/albanD
ghstack dependencies: #108393, #108394
2023-09-10 05:39:12 +00:00
Nikita Shulga
c7bb842d35 [MPS] Add complex add/sub (#108394)
Using `view_as_real` and running elementwise ops in resulted tensors
Add `add` and `sub` to list of complex ops that should work on MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108394
Approved by: https://github.com/albanD
ghstack dependencies: #108393
2023-09-10 05:39:12 +00:00
Nikita Shulga
53a4ca4b58 [MPS][BE] Add dispatch_sync_with_rethrow (#108393)
And enable testing for match_output for complex types.
Most of them should throw an "unsupported XYZ" error, rather than crash.
This fixed several crashes when linalg ops were invoked with complex inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108393
Approved by: https://github.com/kit1980, https://github.com/kulinseth
2023-09-10 02:07:12 +00:00
alexdremov
b60273b88a [MPS] Pixel shuffle unshuffle support (#99306)
Fixes #83196

Now, MPS implementation is blazingly fast.

Though, I have several questions on improving this PR:

1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-09-06 09:11:39 +00:00
CaoE
42f94d7e9f add Half support for maxpool on CPU (#98819)
### Testing
Single socket (28 cores):

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836

Single core:

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98819
Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki
2023-09-05 18:23:41 +00:00
Nikita Shulga
bae409388c [MPS] Fix .item() for multi-dim scalar (#107913)
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>

> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_

Fixes https://github.com/pytorch/pytorch/issues/107867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
2023-08-31 21:08:29 +00:00
vfdev
b7624fc91e Cleaned up test_mps.py::test_output*_match (#108092)
Description:
- cleaned up test_mps.py::test_output_match and test_mps.py::test_output_grad_match tests
  - removed unused variables and useless brackets
  - simplified atol/rtol setup if/else code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108092
Approved by: https://github.com/kulinseth
2023-08-29 10:46:02 +00:00
Nikita Shulga
6e85a68829 [MPS] Implement polar via metal shader (#107324)
Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator.

Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported

Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block.
Fixes https://github.com/pytorch/pytorch/issues/78503

TODOs(in followup PRs):
  - Implement backwards (requires complex mul and sgn)
  - Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR)

Partially addresses https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324
Approved by: https://github.com/albanD
2023-08-25 03:16:23 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
arunppsg
4bfc55ba8b [MPS] Enable forward test for renorm (#106666)
Enabled forward test for renorm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106666
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-08-17 16:46:06 +00:00
Jason Lu
bc88028e8e Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743)
Summary:
Original commit changeset: 81319beb97f3

Original Phabricator Diff: D47961182

Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822

Reviewed By: atuljangra

Differential Revision: D48131623

@diff-train-skip-merge
(D48131623 landed internally)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743
Approved by: https://github.com/malfet
2023-08-08 15:27:34 +00:00
Ramin Azarmehr
cdfd0ea162 [MPS] Introduce torch.mps.Event() APIs (#102121)
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-08-08 03:45:45 +00:00
Li-Huai (Allan) Lin
d4d086ce7b [MPS] Fix Clamp with strided outputs/inputs (#97858)
Fixes #94396
Fixes #87348

1. If output is strided, we don't gather input tensors.
2. If output is not strided but min_t or max_t is strided, we make min_t or max_t contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97858
Approved by: https://github.com/kulinseth
2023-08-04 09:32:12 +00:00
Peter Stefek
c9c2b14c53 Fix copy_ broadcast behavior on mps (#105617)
Fixes #105277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105617
Approved by: https://github.com/malfet
2023-08-03 04:03:32 +00:00
PyTorch MergeBot
d83b887f2a Revert "Add error checking for padding modules (#106147)"
This reverts commit 0547b6279d.

Reverted https://github.com/pytorch/pytorch/pull/106147 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal builds, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/106147#issuecomment-1661870970))
2023-08-02 09:37:40 +00:00
Denis Vieriu
d1a2aa1909 [MPS] Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747)
- Fix the FP16 clamp issue (FP32 and FP16 are not broadcast compatible)
- Fix clamp (cached graph nodes were previously replaced with the cast version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105747
Approved by: https://github.com/kulinseth
2023-08-02 02:51:34 +00:00
Peter Stefek
97e5055a69 Add cumprod support for device mps (#104688)
Related to #77764

Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688
Approved by: https://github.com/kulinseth
2023-08-01 21:51:20 +00:00
Mikayla Gawarecki
0547b6279d Add error checking for padding modules (#106147)
Fixes https://github.com/pytorch/pytorch/issues/105627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106147
Approved by: https://github.com/albanD
ghstack dependencies: #106325
2023-08-01 12:49:58 +00:00
Mikayla Gawarecki
d8e5f2aa6d Reland "Make adding buffers more like adding parameters (#104069)" (#106224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224
Approved by: https://github.com/atalman, https://github.com/albanD
2023-07-31 17:18:56 +00:00
cyy
b8eb827d93 use UBSAN on some tests (#103655)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103655
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-07-24 14:24:49 +00:00
Peter Pham
bba06ad751 [MPS] aten::erfinv metal kernel ops (#101507)
I've added the implementation of erfinv using the algorithm from 4154c8ea15/aten/src/ATen/native/Math.h (L152) in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661

Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")

# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x:  {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time:  2.654937833400254
MPS torch.erfinv time:  0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv:  tensor(4.2339e-14)
Largest difference
x:  -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06

Fixes #https://github.com/pytorch/pytorch/issues/86808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507
Approved by: https://github.com/kulinseth
2023-07-23 01:36:43 +00:00
Jane Xu
803d42e457 add lerp cpu support for half (#105607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607
Approved by: https://github.com/albanD
2023-07-21 20:29:05 +00:00
Andrey Talman
c6653b65d8 Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/

with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`

Original commit changeset: d4b4069fbd38

Original Phabricator Diff: D47537831

Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```

Reviewed By: atalman

Differential Revision: D47600140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
Justin Chu
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
Peter Stefek
d2c24eca8a Fix mps unary op issue on non densely stored tensors (#105512)
This pr fixes a bug where non densely stored tensors were not converted to the dense tensors of the correct scalar type in the mps `unary_op` helper function

Fixes https://github.com/pytorch/pytorch/issues/105284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105512
Approved by: https://github.com/malfet
2023-07-19 03:56:38 +00:00
Nikita Shulga
8cd94e1eab [MPS] Add lerp implementation (#105470)
lerp.Scalar fits very well into binary op template
Add a very naive implementation for `lerp.Tensor` as `add_out(self, weights.mul(end.sub(self)))`

Enable `lerp` testing in `test_mps`

Fixes https://github.com/pytorch/pytorch/issues/105382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105470
Approved by: https://github.com/albanD
2023-07-18 20:01:04 +00:00
ekamiti
32d422f335 Make adding buffers more like adding parameters (#104069)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
David Radley
17250976f3 correct empty tensor mps all operation (#105218)
Fixes #104694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105218
Approved by: https://github.com/ezyang, https://github.com/kulinseth
2023-07-14 17:42:54 +00:00
albanD
08cbfb2a58 Avoid tensor creation and use scalar overload (#104264)
I would expect this preserves the behavior but there might be weird edge cases?
@mruberry might know?

The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264
Approved by: https://github.com/zou3519
2023-07-12 18:11:27 +00:00
Nikita Shulga
5e4ee15e85 [MPS] Fix unique flatten logic (#104938)
Tensor must be flatted if dim is none before checking whether or not dim dimension is already None

Fixes https://github.com/pytorch/pytorch/issues/104879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104938
Approved by: https://github.com/albanD
2023-07-11 19:55:56 +00:00
soulitzer
91dcc3b272 Fix activation checkpoint for mps (#104787)
Fixes https://github.com/pytorch/pytorch/issues/104478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104787
Approved by: https://github.com/albanD
2023-07-08 14:57:05 +00:00
Jerry Zhang
611febf6cf [quant] Support integer implementations for max_pool2d (#104225)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225
Approved by: https://github.com/kimishpatel
2023-07-05 23:54:07 +00:00
Nikita Shulga
01e6d64dd2 [MPS] Fix unary ops over sparse-mapped tensors (#100765)
If input tensor is backed by a sparse view, create a dense copy before running unary op, otherwise op will be applied against the wrong elements.
Introduce `is_dense_in_storage` that returns true if tensor/view are mapped to a dense area in  the tensor storage.
Add unit test to validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/98074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100765
Approved by: https://github.com/albanD
2023-07-05 23:17:43 +00:00
Denis Vieriu
28720ad585 Fix argmax and argmin clamp value on MPS (#104374)
Replace clamp `LLONG_MAX` clamp value with the largest integer value that can be stored in a double. `constantWithScalar` takes as input a `double` value, for which `LLONG_MAX` was not fitting in a dobule, resulting in failures on x86.

Fixes https://github.com/pytorch/pytorch/issues/98191, https://github.com/pytorch/pytorch/issues/92311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104374
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-06-30 18:11:49 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00