When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
Fixes#86805
Adds support for sgn to MPS backend.
Notes:
1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.
2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula. Check comments.
This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.
Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device
Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>
> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
Fixes issue mentioned in #77764
e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744
Adds MPS support for the following ops:
- lgamma
- mvlgamma
- digamma
- polygamma
The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations.
NOTE:
The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692)
The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292
Approved by: https://github.com/kulinseth
Fixes#83196
Now, MPS implementation is blazingly fast.
Though, I have several questions on improving this PR:
1. I copied code from `test_nn.py`. Is there better way to test this?
2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs:
```
`usePixelShuffleOrder` can be
used to control how the data within spatial blocks is ordered in the
`depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the
spatial blocks are stored contiguosly within the `depthAxis` dimension whereas
otherwise they are stored interleaved with existing values in the `depthAxis` dimension.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306
Approved by: https://github.com/kulinseth, https://github.com/malfet
By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor.
Also, print a more reasonable error message when dst dim is less than src in mps_copy_
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at abd06e6</samp>
> _Sing, O Muse, of the valiant deeds of the PyTorch developers_
> _Who strive to improve the performance and usability of tensors_
> _And who, with skill and wisdom, fixed a bug in the MPS backend_
> _That caused confusion and dismay to many a user of `item()`_
Fixes https://github.com/pytorch/pytorch/issues/107867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913
Approved by: https://github.com/albanD
Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator.
Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported
Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block.
Fixes https://github.com/pytorch/pytorch/issues/78503
TODOs(in followup PRs):
- Implement backwards (requires complex mul and sgn)
- Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR)
Partially addresses https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324
Approved by: https://github.com/albanD
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
Related to #77764
Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688
Approved by: https://github.com/kulinseth
I've added the implementation of erfinv using the algorithm from 4154c8ea15/aten/src/ATen/native/Math.h (L152) in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661
Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")
# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x: {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time: 2.654937833400254
MPS torch.erfinv time: 0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv: tensor(4.2339e-14)
Largest difference
x: -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06
Fixes #https://github.com/pytorch/pytorch/issues/86808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507
Approved by: https://github.com/kulinseth
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8f8d620</samp>
This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153
Approved by: https://github.com/drisspg
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 069fd23</samp>
This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch.
Fixes https://github.com/pytorch/pytorch/issues/100717
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946
Approved by: https://github.com/kulinseth
In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value.
MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s).
Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108
Approved by: https://github.com/kulinseth
Prevent using parallel computing when deterministic algorithm is set.
Fixes#97574
Benchmark:
```
[--------------- index_put_ Deterministic Algorithm Enabled ---------------]
| cpu | mps
1 threads: -----------------------------------------------------------------
Dtype: torch.float32 Features: 1024; Num Indices: 512 | 37 | 49
Dtype: torch.float32 Features: 1024; Num Indices: 1024 | 54 | 50
Dtype: torch.float32 Features: 1024; Num Indices: 2048 | 86 | 50
Dtype: torch.float32 Features: 1024; Num Indices: 4096 | 150 | 49
Times are in microseconds (us).
[-------------- index_put_ Deterministic Algorithm Disabled ---------------]
| cpu | mps
1 threads: -----------------------------------------------------------------
DType: torch.float32 Features: 1024; Num Indices: 512 | 37 | 49
DType: torch.float32 Features: 1024; Num Indices: 1024 | 53 | 49
DType: torch.float32 Features: 1024; Num Indices: 2048 | 86 | 49
DType: torch.float32 Features: 1024; Num Indices: 4096 | 147 | 50
Times are in microseconds (us).
```
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at ebf2ff3</samp>
Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660
Approved by: https://github.com/kulinseth
Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367
Approved by: https://github.com/albanD
Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR
Example:
Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])`
call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])`
call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded)
call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated)
Test Plan:
**Unit Tests**
```
buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static'
```
**PT2 Export with `nonzero_static()`**
Example of `GraphModule` in the exported graph
```
def forward(self, x):
arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4); arg0 = None
return pytree.tree_unflatten([nonzero_static_default], self._out_spec)
```
Differential Revision: D44324808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417
Approved by: https://github.com/ezyang
- Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad
- Added a test case
- The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988
Approved by: https://github.com/DenisVieriu97
The native implementation of LSTM has been fixed on macOS 13.
On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side.
Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable.
Fixes: #90421
Issues related: #80306, #83144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909
Approved by: https://github.com/albanD, https://github.com/kulinseth
Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"`
Modify `castMPSTensor` to become a no-op if cast is not needed
Define `common_dtype` as `c10::promoType` between self, tensor1 and
tensor2. Cast to any output type.
Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover
all the permutations
Discovered while looking at https://github.com/pytorch/pytorch/issues/96113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164
Approved by: https://github.com/kulinseth
I.e. attempt to create tensor of all possible types and make sure that
it raises a structured error for non-MPS types
Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before)
Fixes https://github.com/pytorch/pytorch/issues/95976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982
Approved by: https://github.com/kulinseth
MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817
Approved by: https://github.com/kulinseth
Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know).
Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS.
Did a small benchmark:
```python
for num in [10, 100, 1000, 10000]:
for shft in [1, 5]:
sz = num * num
x = torch.arange(sz, device="cpu").view(num, num)
s = time.time()
r = torch.roll(x, shft)
cpu_e = time.time() - s
x = torch.arange(sz, device="mps").view(num, num)
s = time.time()
r = torch.roll(x, shft)
mps_e = time.time() - s
print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}")
```
```
size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578
size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172
size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172
size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375
size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938
size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203
size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945
size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168
Approved by: https://github.com/albanD
Add `mps_ops_modifier` function that adds `unittest.expectedFailure` decorators to the operators that supposed to fail on MPS.
This allows one to know whether or not operation will fail, rather than skip it.
For example:
```
% python test_mps.py -v -k test_output_match_dot
test_output_match_dot_cpu_float32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int16 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int64 (__main__.TestConsistencyCPU) ... expected failure
test_output_match_dot_cpu_uint8 (__main__.TestConsistencyCPU) ... ok
----------------------------------------------------------------------
Ran 5 tests in 0.175s
OK (expected failures=1)
```
Moved a few functions from blocklist to xfail, and find out that some of the functions in the list actually work, for example `torch.long`.
Also, allow `None` to be used in `ALLOWLIST` instead of specifying all types explicitly (which aligns with `DecorateInfo` semantic)
Eventually, we should get rid of `ALLOWLIST` (i.e. all ops are allowed), keep small `BLOCKLIST` and move the rest to `XFAILLIST`
Add step to print HW/SW info before running MPS tests.
Fix type promotion in `trace_mps_out`
Introduce `MACOS_12_X_XFAILLIST` and skip almost every function for `torch.uint8`, although some of those doesn't make much sense and feels like a regression from PyTorch-1.13
Re-enabled MPS testing on MacOS 12, as runners seems to be available again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95045
Approved by: https://github.com/albanD
Fixes#91694Fixes#92615
Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`.
After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.
After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded
Funny enough, backward tests were completely disabled before and were not passing:
```python
@unittest.skipIf(True, "Backward of lstm returns wrong result")
def test_lstm_2(self, device="mps", dtype=torch.float32):
```
UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer
Fixes#94390
Apart from fixing the issue above, this PR also fixes a bug that when an input tensor can be sliced, a sliced array view is created. This array view seems to be not writable or have a different storage from the original tensor, causing incorrect results with the in-place `fill`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95113
Approved by: https://github.com/kulinseth
Previously, the "can slice" flag in Placeholder constructor in `OperationUtils.mm` is conditioned on whether the numbers of dimensions of base shape and view shape are the same. This doesn't consider the situation that a view tensor could be the base tensor's sliced and then unsqueezed version, resulting in different num of dims.
For example, if we want to stack `y_mps` and `x_mps` on the last dim:
```
t_mps = torch.tensor([1, 2, 3, 4], device="mps")
x_mps = t_mps[2:] # [3, 4]
y_mps = t_mps[:2] # [1, 2]
res_mps = torch.stack((y_mps, x_mps), dim=-1)
```
the kernel will unsqueeze both of them on the last dim and then concatenate them, which is equivalent to:
```
res_mps = torch.cat((y_mps.unsqueeze(-1), x_mps.unsqueeze(-1)), dim=-1)
```
`x_mps.unsqueeze(-1)` is an unsqueezed and contiguous tensor with a storage offset, this kind of tensors should be sliceable without cloning its storage.
Fixes#87856Fixes#91065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91071
Approved by: https://github.com/kulinseth
Fixes backward pass for bilinear.
Summary of changes:
- bilinear op is able to produce **contiguous, non-view** tensors with a storage offset, such as: shape=`[1, 1, 1, 1]`, `storage_offset=12`. This seems a weird case, but it is valid, and for these type of tensors we wouldn't be able to gather/scatter since we look at the view flag (which is not set here). This change looks into `storage_offset` only rather than the is_view flag which is not being set
- **reduction sum** must return a zeroed out output if passing an input with 0 elements (e.g a shape of (0, 5)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94892
Approved by: https://github.com/kulinseth
- Backward pass has to give explicit bias tensor of zeros if none is passed to the op or the bias gradient will not be calculated.
- Fixed bias tensor mistakenly getting overwritten to zeros
- Fixes crash when lstm op called with has_biases set to false. Change takes into account the changed shape of the input params TensorList depending on the bias flag.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94889
Approved by: https://github.com/DenisVieriu97
- To check for Memory Leaks in `test_mps.py`, set the env-variable `PYTORCH_TEST_MPS_MEM_LEAK_CHECK=1` when running test_mps.py (used CUDA code as reference).
- Added support for the following new python interfaces in MPS module:
`torch.mps.[empty_cache(), set_per_process_memory_fraction(), current_allocated_memory(), driver_allocated_memory()]`
- Renamed `_is_mps_on_macos_13_or_newer()` to `_mps_is_on_macos_13_or_newer()`, and `_is_mps_available()` to `_mps_is_available()` to be consistent in naming with prefix `_mps`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94646
Approved by: https://github.com/malfet
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
`torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`
Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
Fixes#87219
Implements new ``repeat_interleave`` function into ``aten/src/ATen/native/mps/operations/Repeat.mm``
Adds it to ``aten/src/ATen/native/native_functions.yaml``
Adds new test ``test_repeat_interleave`` to ``test/test_mps/py``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88649
Approved by: https://github.com/kulinseth
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
`torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`
Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
Summary:
- Remove redundant bool casts from scatter/gather
- Make the workarounds for scatter/gather (for bool/uint8 data types) OS specific - use them only in macOS Monterey, ignore them starting with macOS Ventura
- Make all tensors ranked in scatter
Fixes following tests:
```
test_output_match_slice_scatter_cpu_bool
test_output_match_select_scatter_cpu_bool
test_output_match_diagonal_scatter_cpu_bool
test_output_match_repeat_cpu_bool
test_output_match_rot90_cpu_bool
etc..
```
Still failing on macOS Monterey (needs additional investigation):
```
test_output_match_scatter_cpu_bool
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94464
Approved by: https://github.com/kulinseth
- Also fix FP16 correctness issues in several other ops by lowering their FP16 precision in the new list `FP16_LOW_PRECISION_LIST`.
- Add atol/rtol to the `AssertEqual()` of Gradient tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94567
Approved by: https://github.com/kulinseth
Fixes batchnorm forward/backward pass and layer_norm:
Batchnorm Forward pass:
```
- fix batch_norm_mps_out key
- return 1/sqrt(var+epsilon) instead of var
- return empty tensor for mean and var if train is not enabled
- remove native_batch_norm from block list
```
Batchnorm Backward pass:
```
- add revert caculation for save_var used in backward path
- add backward test for native_batch_norm and _native_batch_norm_legit
```
Layer norm:
```
- remove the duplicate calculation from layer_norm_mps
- enable native_layer_norm backward test
- raise atol rtol for native_layer_norm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94351
Approved by: https://github.com/razarmehr
Calculate nonzero count directly in the nonzero op.
Additionally, synchronize before entering nonzero op to make sure all previous operations finished (output shape is allocated based on the count_nonzero count)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94442
Approved by: https://github.com/kulinseth
- fix num_output_dims calculation
- fix median_out_mps key
- cast tensor sent to sortWithTensor and argSortWithTensor
- note down same issue for unique
- unblock median from blocklist
- adding test_median_int16 test
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94489
Approved by: https://github.com/razarmehr
- Fix wrong results in AvgPool2D when `count_include_pad=True`
- Fix issues with adaptive average and max pool2d
- Remove the redundant blocking copies from `AdaptiveMaxPool2d`
- Add `divisor` to cached string key to avoid conflicts
- Add test case when both `ceil_mode` and `count_include_pad` are True (previously failed).
- Clean up redundant code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94348
Approved by: https://github.com/kulinseth