Otherwise, invoking with torch.half inputs, but float weights will result in
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification'
/AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification'
```
Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170
Approved by: https://github.com/kit1980, https://github.com/Skylion007
ghstack dependencies: #144167, #144162, #144083, #144084
Fixes#141457
As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250).
Fixes#141909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095
Approved by: https://github.com/kulinseth
# Motivation
Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return.
# Additional Context
Add a mps UT to test code change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171
Approved by: https://github.com/albanD
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:
> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.
Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.
This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
x[idx] += idx;
}
")
m.foo(x)
```
And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
x[idx] += idx;
}
")
m.foo(x)
```
And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.
Added a regression test that verifies the output by running the same op on the CPU.
I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?
Fixes#141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one)
This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation)
Fixes https://github.com/pytorch/pytorch/issues/140902
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009
Approved by: https://github.com/manuelcandales
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.
I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.
Fixes#140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`
For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.
This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales
Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output.
Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009
Approved by: https://github.com/Skylion007
This fixes internal crash due to the invalid bufer size computation if sliced API is used
Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
baseShape = src._base().sizes();
} else {
baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.
When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes
Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same
Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`
Discovered that:
- `torch.randn` and `torch.randint` fall into the same undefined category
- `torch.logspace` is not implemented for MPS
- Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
- `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
- `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13.
~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Not sure, why `isinf` is a composite op, but those needs to be implemented by hand.
Implementation is a trivial call to
```objc
[mpsGraph equalWithPrimaryTensor:input
secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity()
dataType:input.dataType]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689
Approved by: https://github.com/Skylion007
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
- Switch from 2D dispatch to 1D one (to match CUDA behavior)
- Added batch + channel loops
- Fixed scale computation to match align corners behavior
- Added backward implementation
Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
device atomic<int>* data,
long offset,
float value) {
auto ptr = data + (offset >> 1);
auto old = atomic_load_explicit(ptr, memory_order_relaxed);
union {
int i;
T t[2];
} val;
do {
val.i = old;
val.t[offset & 1] += static_cast<T>(value);
} while (!atomic_compare_exchange_weak_explicit(
ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
Fixes#131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.
Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:
```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```
Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.
Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests
Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`
Fixes https://github.com/pytorch/pytorch/issues/134423
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
Partial fix to issue #130295
Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445
Approved by: https://github.com/malfet
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.
Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
0.12.0 Major Updates:
- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support
0.12.1 Updates:
- Fix warning regression during import when launch with strict warning filters
Closes#130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
0.12.0 Major Updates:
- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support
0.12.1 Updates:
- Fix warning regression during import when launch with strict warning filters
Closes#130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.
Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue
Enabled `test_reduction_ops_5D` and added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
Summary:
1. Fixed#130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:
```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32
```
~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
* Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
* Test breakdown (>100 total)
* A lot of tolerance issues (tweaked tolerance values to fix)
* 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
* 3 actually broken semantics (for masked tensor; added xfails)
* 4 Jacobian mismatches (added xfails)
* 2 nan results (skip for now, need fixing)
* 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.
**before (no seed setting):**
```
real 0m21.306s
user 0m19.053s
sys 0m5.192s
```
**after (with seed setting):**
```
real 0m21.905s
user 0m19.578s
sys 0m5.390s
```
* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.
```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
return test(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
self.assertFalse(True)
AssertionError: True is not false
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
result = test(self, **param_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
fn(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')
To execute this test, run the following from the base repo dir:
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.037s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value.
The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484
Approved by: https://github.com/kulinseth
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d75cde1</samp>
Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269
Approved by: https://github.com/kulinseth, https://github.com/lezcano
Get's rid of the following warning:
```
/Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if base.storage().data_ptr() != other.storage().data_ptr():
```
(noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 )
Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838
Approved by: https://github.com/albanD
- Implement a very straightforward Metal copy of CPU int4mm kernel
- Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm
- Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following
```cpp
if (getMPSProfiler().profiler.isCaptureEnabled()) {
getMPSProfiler().startCapture(__func__, mpsStream);
}
...
if (getMPSProfiler().isCapturing()) {
getMPSProfiler().stopCapture(mpsStream);
}
```
that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel
<img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a">
- Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg`
TODOs:
- Add weight pack
- Perf-tune both kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163
Approved by: https://github.com/mikekgfb
`cumsum` and `cumprod` was (is?) buggy for MPS: c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)
A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.
It turns out this issue also affects boolean types:
```python
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
```
In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.
Fixes#96614#106112#109166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318
Approved by: https://github.com/malfet
Validate that all arguments are on MPS devices and dtypes are expected
Fixes cryptic messages like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))"
RuntimeError: Placeholder storage has not been allocated on MPS device!
```
And hard crashes like
```
% python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))"
```
Fixes https://github.com/pytorch/pytorch/issues/123995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952
Approved by: https://github.com/Skylion007