Sun, Jiayi
c173a9d9b3
add Half support for layer_norm on CPU ( #99590 )
...
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |
Single core (icx):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima , https://github.com/jgong5 , https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
Nikita Shulga
9dda4b20a0
[MPS] Enable select/[broad]cast ops for complex dtypes ( #115727 )
...
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
2023-12-19 02:25:28 +00:00
Peter Pham
74dfdc567b
[MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing ( #105801 )
...
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706
The follow test would fail without this bug fix:
```
import torch
def test_erfinv():
for device in ['cpu', 'mps']:
x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
y = x[2:].erfinv()
x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
y2 = x2.erfinv()
print(y)
print(y2)
torch.testing.assert_close(y, y2)
print(f"{device} passes.")
test_erfinv()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
2023-12-15 23:14:03 +00:00
Lucas Steuernagel
2e517b20d9
[MPS] Add Conv3D support for MPS ( #114183 )
...
Fixes #77818
I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:
> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)
Please, let me know if there are any other changes needed and I'll be happy to implement them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
2023-12-15 23:05:01 +00:00
mingfeima
a8acd6c410
Add Half support for AvgPool2d on CPU ( #109578 )
...
Add Half support for AvgPool2d (both channels last and channels first) on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578
Approved by: https://github.com/mingfeima , https://github.com/albanD
2023-12-12 12:59:47 +00:00
igm503
f017a1af3f
[MPS] add complex_out to MPS backend ( #110851 )
...
Adds support for at::complex_out to the MPS backend
Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851
Approved by: https://github.com/kulinseth
2023-12-11 13:37:55 +00:00
Li-Huai (Allan) Lin
38e1440bae
[MPS] Remove redundant topk test and move all pad tests inside a class ( #113313 )
...
Summary:
1. The removed `topk` test is essentially very similar to the following test, so I remove it:
```python
def test_topk(self):
def helper(shape):
cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False)
x = cpu_x.detach().clone().to('mps')
for largest_val in [True, False]:
if (type(shape) == tuple):
for curr_dim in range(0, len(shape)):
dim_size = shape[curr_dim]
for k in range(1, dim_size + 1):
topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val)
topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val)
self.assertEqual(topk_values, topk_values_cpu)
self.assertEqual(topk_indices, topk_indices_cpu)
else:
for k in range(1, shape):
topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val)
topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val)
self.assertEqual(topk_values, topk_values_cpu)
self.assertEqual(topk_indices, topk_indices_cpu)
helper(2)
helper((5, 1))
helper((1, 5))
helper((5, 9, 7, 4))
helper((50, 20, 7, 4))
```
297c26bb8e/test/test_mps.py (L8054-L8091)
2. Move all pad tests to one standalone class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313
Approved by: https://github.com/kulinseth
ghstack dependencies: #113312
2023-12-01 06:52:07 +00:00
Li-Huai (Allan) Lin
88a659e752
[MPS] Move non-nll loss tests outside TestNLLLoss ( #113312 )
...
The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312
Approved by: https://github.com/kulinseth
2023-12-01 06:52:07 +00:00
Nikita Shulga
1b27eae65e
[MPS] Fix out-of-bounds fill to sliced tensor ( #114838 )
...
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros
Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc ) arguments are location and length rather than start and end offset.
In `fill_mps_tensor_`:
- Pass `value` argument to `MPSStream::fill`
- Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero
Add regression test
Fixes https://github.com/pytorch/pytorch/issues/114692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838
Approved by: https://github.com/atalman , https://github.com/kulinseth
2023-12-01 06:24:42 +00:00
Khushi Agrawal
cff84871ce
[reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops ( #114589 )
...
Previous PR: #113885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589
Approved by: https://github.com/lezcano
2023-11-27 14:45:44 +00:00
PyTorch MergeBot
150aaf46ca
Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops ( #113885 )"
...
This reverts commit 4fa1ff8404 .
Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk 4fa1ff8404 ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473 ))
2023-11-27 07:33:00 +00:00
Khushi Agrawal
4fa1ff8404
[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation|groups & add ErrorInputs for conv ops ( #113885 )
...
Previous PR: https://github.com/pytorch/pytorch/pull/85202
Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885
Approved by: https://github.com/lezcano
2023-11-26 13:44:30 +00:00
Nikita Shulga
324cde59b2
[MPS] Fix test_copy_cast_no_leak ( #114313 )
...
When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization.
Fixes https://github.com/pytorch/pytorch/issues/114305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313
Approved by: https://github.com/huydhn
2023-11-22 14:48:24 +00:00
Nikita Shulga
b5dd37f23e
[MPS] Fix memory leak in copy_from_mps_ ( #114197 )
...
By always calling `[destBuffer release]` before leaving the scope in which it was allocated.
Leak was introduced by https://github.com/pytorch/pytorch/pull/84928
Add regression test.
Before the change:
```
% python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10
test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL
======================================================================
FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper
method(*args, **kwargs)
File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak
self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory")
AssertionError: False is not true : Detected 65536 bytes leak of GPU memory
To execute this test, run the following from the base repo dir:
python test/test_mps.py -k test_copy_cast_no_leak
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 1.102s
FAILED (failures=1)
```
After:
```
% python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10
.
----------------------------------------------------------------------
Ran 1 test in 0.819s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.001s
OK
.
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
...
```
Fixes https://github.com/pytorch/pytorch/issues/114096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197
Approved by: https://github.com/kit1980
2023-11-21 14:52:55 +00:00
Li-Huai (Allan) Lin
538114db65
[MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output ( #97085 )
...
Fixes #100764
This PR fixes the unary ops implementation and refactors the binary ops implementation a bit.
For unary ops:
Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think.
For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125
See the added test for repro.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085
Approved by: https://github.com/malfet
2023-11-14 22:03:21 +00:00
Nikita Shulga
265d6aac0b
[MPS] Fix crashes during Conv backward pass ( #113398 )
...
By adding weights tensor to the MPSGraph cache key.
Add regression test to validate that collision no longer happens
Fixes https://github.com/pytorch/pytorch/issues/112998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398
Approved by: https://github.com/kulinseth
2023-11-10 04:29:33 +00:00
Li-Huai (Allan) Lin
740137df6f
[MPS] Add bucketize op ( #112830 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830
Approved by: https://github.com/kulinseth , https://github.com/malfet
ghstack dependencies: #112829
2023-11-07 17:22:08 +00:00
Li-Huai (Allan) Lin
c4bb77323d
[MPS] Add searchsorted op ( #112829 )
...
The metal kernels implemented are closely following `Bucketization.cu`.
Benchmark:
```
[----------------------------- searchsorted ----------------------------]
| cpu | mps
1 threads: --------------------------------------------------------------
Batch size: 8; In features: 64; Sorter: True | 44 | 530
Batch size: 8; In features: 64; Sorter: False | 31 | 12
Batch size: 8; In features: 256; Sorter: True | 131 | 520
Batch size: 8; In features: 256; Sorter: False | 107 | 12
Batch size: 8; In features: 1024; Sorter: True | 499 | 590
Batch size: 8; In features: 1024; Sorter: False | 398 | 12
Batch size: 16; In features: 64; Sorter: True | 71 | 540
Batch size: 16; In features: 64; Sorter: False | 57 | 12
Batch size: 16; In features: 256; Sorter: True | 242 | 610
Batch size: 16; In features: 256; Sorter: False | 200 | 12
Batch size: 16; In features: 1024; Sorter: True | 999 | 720
Batch size: 16; In features: 1024; Sorter: False | 842 | 12
Batch size: 32; In features: 64; Sorter: True | 124 | 509
Batch size: 32; In features: 64; Sorter: False | 103 | 12
Batch size: 32; In features: 256; Sorter: True | 477 | 650
Batch size: 32; In features: 256; Sorter: False | 407 | 12
Batch size: 32; In features: 1024; Sorter: True | 1940 | 833
Batch size: 32; In features: 1024; Sorter: False | 1710 | 12
Batch size: 64; In features: 64; Sorter: True | 231 | 590
Batch size: 64; In features: 64; Sorter: False | 194 | 12
Batch size: 64; In features: 256; Sorter: True | 937 | 710
Batch size: 64; In features: 256; Sorter: False | 800 | 13
Batch size: 64; In features: 1024; Sorter: True | 3980 | 1290
Batch size: 64; In features: 1024; Sorter: False | 3330 | 12
Batch size: 128; In features: 64; Sorter: True | 448 | 650
Batch size: 128; In features: 64; Sorter: False | 390 | 13
Batch size: 128; In features: 256; Sorter: True | 1830 | 850
Batch size: 128; In features: 256; Sorter: False | 1590 | 12
Batch size: 128; In features: 1024; Sorter: True | 7790 | 2850
Batch size: 128; In features: 1024; Sorter: False | 6670 | 13
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829
Approved by: https://github.com/malfet
2023-11-07 17:22:08 +00:00
CaoE
455241bbd3
Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU ( #112138 )
...
Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138
Approved by: https://github.com/cpuhrsch
2023-11-06 06:01:29 +00:00
CaoE
26b5e27ace
Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU ( #112132 )
...
Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132
Approved by: https://github.com/cpuhrsch
2023-11-05 12:31:38 +00:00
Li-Huai (Allan) Lin
30237aaeec
[MPS] Fix bug when value is of complex ( #111937 )
...
When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937
Approved by: https://github.com/malfet
ghstack dependencies: #111885
2023-10-31 17:50:56 +00:00
CaoE
a310cc8968
Add Half support for kthvalue, cross, hist, and logit on CPU ( #112135 )
...
Add Half support for kthvalue, cross, hist, and logit on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135
Approved by: https://github.com/cpuhrsch
2023-10-31 09:12:47 +00:00
Peter Bell
bbd5b935e4
Use pytree.tree_leaves everywhere ( #112324 )
...
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327 , #112323
2023-10-30 03:39:04 +00:00
Cao E
1c89ea7f72
Add Half support for softmax and log_softmax on CPU ( #103315 )
...
Add Half support for softmax and log_softmax on CPU.
Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315
Approved by: https://github.com/jgong5 , https://github.com/mikaylagawarecki , https://github.com/malfet
2023-10-26 08:38:54 +00:00
Peter Bell
46e80ce58a
[ATen] Support multi dim any and all reductions ( #110310 )
...
This adds a new overload to `all` and `any` with support for multiple reduction dims.
```
all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310
Approved by: https://github.com/lezcano , https://github.com/albanD , https://github.com/justinchuby
2023-10-24 21:33:53 +00:00
Li-Huai (Allan) Lin
4b804dac33
[MPS] Add complex support for fill ( #111885 )
...
Fixes #110537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885
Approved by: https://github.com/malfet
2023-10-24 06:41:10 +00:00
CaoE
4b324a8717
Add Half support for aminmax on CPU ( #106853 )
...
Add Half support for aminmax on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853
Approved by: https://github.com/cpuhrsch
2023-10-23 17:43:47 +00:00
CaoE
d1afb7d43d
add Half support for multinomial on CPU ( #104178 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178
Approved by: https://github.com/jgong5 , https://github.com/kulinseth , https://github.com/cpuhrsch
2023-10-20 19:16:04 +00:00
CaoE
2a40b7efcb
Add Half support for addcmul, addcdiv, cumsum, and topk on CPU ( #103319 )
...
Add Half support for addcmul, addcdiv, cumsum, and topk on CPU.
Note: This PR will introduce the issue https://github.com/pytorch/pytorch/issues/111454 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319
Approved by: https://github.com/jgong5 , https://github.com/cpuhrsch
2023-10-19 17:47:45 +00:00
CaoE
8713a1a363
add Half support for bernoulli on CPU ( #104176 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176
Approved by: https://github.com/mingfeima , https://github.com/cpuhrsch
2023-10-13 01:18:55 +00:00
Kurt Mohler
5292a92e03
Add torch.unravel_index ( #110580 )
...
Fixes #35674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580
Approved by: https://github.com/lezcano , https://github.com/kulinseth
2023-10-12 00:55:51 +00:00
igm503
95ff51d8ed
[MPS] Add support for Softshrink to MPS Backend ( #110814 )
...
Adds the softshrink activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814
Approved by: https://github.com/kulinseth
2023-10-11 07:55:39 +00:00
igm503
4b881b0da3
[MPS] add support for sgn to MPS backend ( #110829 )
...
Fixes #86805
Adds support for sgn to MPS backend.
Notes:
1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it.
2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works.
3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829
Approved by: https://github.com/malfet
2023-10-09 16:53:25 +00:00
vfdev-5
d2a2a67fa4
Added new test sample to interpolate op in OpInfo ( #104181 )
...
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier , https://github.com/lezcano
2023-10-09 10:55:56 +00:00
igm503
a389181f2e
[MPS] add support for aten::nextafter ( #109685 )
...
Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591
Adds support for aten::nextafter to the MPS backend. Supports float and half types.
Notes:
- I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions
- Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685
Approved by: https://github.com/kulinseth
2023-10-03 19:20:22 +00:00
PyTorch MergeBot
df3ab70dde
Revert "Added new test sample to interpolate op in OpInfo ( #104181 )"
...
This reverts commit 87f8bc65f8 .
Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323 ))
2023-10-03 18:07:02 +00:00
vfdev-5
87f8bc65f8
Added new test sample to interpolate op in OpInfo ( #104181 )
...
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier , https://github.com/lezcano
2023-10-02 15:35:48 +00:00
CaoE
9399e0b1ff
add fp16 support for gemm ( #99498 )
...
### Testing
Native matmul vs. mkldnn matmul on SPR (with avx512_fp16 support)
single core:
Input | Naïve impl / ms | oneDNN / ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401
56 cores:
Input | Naïve impl / ms | oneDNN / ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 | 18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849 | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 | 7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128 | 2032.523 | 0.705 | 2882.23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5 , https://github.com/malfet
2023-09-28 01:03:50 +00:00
Li-Huai (Allan) Lin
ac1e85161e
[MPS] Fix nll_loss with default ignore_index ( #109574 )
...
`-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies.
Fixes #108148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574
Approved by: https://github.com/kulinseth
ghstack dependencies: #109557
2023-09-26 04:13:09 +00:00
Li-Huai (Allan) Lin
0087118997
[MPS] Fix mps to cpu copy with storage offset ( #109557 )
...
Fix #108978
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557
Approved by: https://github.com/DenisVieriu97
2023-09-26 04:13:08 +00:00
CaoE
7c9052165a
add fp16 support for native conv and deconv on CPU ( #99497 )
...
### Testing
Native conv vs. mkldnn conv on SPR (with avx512_fp16 support)
Single core:
Input | Naïve impl / us | oneDNN / us | Speed up
-- | -- | -- | --
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 34676789 | 524199.8 | 66.15185
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 33454125 | 349844.4 | 95.62573
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 317650.1 | 2317.677 | 137.0554
IC: 128, OC: 256, kernel: 3, stride: 1, N: 1, L: 64 | 15334.68 | 167.264 | 91.67952
56 cores:
Input | Naïve impl / us | oneDNN / us | Speed up
-- | -- | -- | --
IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 1032064 | 11073.58 | 93.20061
IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 1000097 | 16371.19 | 61.08883
IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 981813.4 | 9008.908 | 108.9825
IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 1082606 | 10150.47 | 106.6558
IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 | 319980.6 | 181.598 | 1762.027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497
Approved by: https://github.com/jgong5 , https://github.com/cpuhrsch
2023-09-25 01:31:26 +00:00
igm503
255d1a776a
[MPS] Add support for Mish to MPS backend ( #109786 )
...
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444 )
Adds the mish activation function to the mps backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786
Approved by: https://github.com/kulinseth
2023-09-21 21:01:20 +00:00
igm503
0317626df5
[MPS] adding weight_norm_interface support for mps ( #108008 )
...
Fixes #104513
Adds support for aten::_weight_norm_interface to the mps backend.
Also adds a consistency test for the output and the grad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008
Approved by: https://github.com/kulinseth
2023-09-20 02:18:28 +00:00
CaoE
54c28c564f
add Half support for BatchNorm on CPU ( #102070 )
...
Fixes #106543
### Testing
Single core:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882
28 cores:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 | 0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5 , https://github.com/mikaylagawarecki , https://github.com/mingfeima
2023-09-19 10:43:33 +00:00
PyTorch MergeBot
be9f73f031
Revert "Add meta and OpInfo for _embedding_bag_dense_backward ( #109211 )"
...
This reverts commit fe14e43d14 .
Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918 ))
2023-09-14 22:29:12 +00:00
Edward Z. Yang
fe14e43d14
Add meta and OpInfo for _embedding_bag_dense_backward ( #109211 )
...
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula. Check comments.
This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD , https://github.com/zou3519
2023-09-14 18:49:32 +00:00
PyTorch MergeBot
b226373d16
Revert "add Half support for BatchNorm on CPU ( #102070 )"
...
This reverts commit b6a1d3fb97 .
Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main b6a1d3fb97 https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065 ))
2023-09-14 16:13:34 +00:00
CaoE
b6a1d3fb97
add Half support for BatchNorm on CPU ( #102070 )
...
Fixes #106543
### Testing
Single core:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.7116 | 0.1427 | 0.1744 | 0.2638 | 0.2002 | 0.2556
(1, 32, 100, 100) | 0.8579 | 0.1725 | 0.2077 | 0.3023 | 0.2399 | 0.2995
(32, 16, 200, 200) | 57.3466 | 12.2179 | 13.1320 | 45.9524 | 24.1526 | 24.9882
28 cores:
shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
(1, 4, 256, 256) | 0.2571 | 0.0713 | 0.0846 | 0.1140 | 0.0883 | 0.1043
(1, 32, 100, 100) | 0.1077 | 0.0510 | 0.0548 | 0.0700 | 0.0645 | 0.0713
(32, 16, 200, 200) | 5.5060 | 1.4195 | 1.4663 | 6.773 | 3.0886 | 3.1343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070
Approved by: https://github.com/jgong5 , https://github.com/mikaylagawarecki
2023-09-14 12:23:59 +00:00
PyTorch MergeBot
04a765f95d
Revert "add Half support for BatchNorm on CPU ( #102070 )"
...
This reverts commit 6065e7a97c .
Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` 6065e7a97c https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208 ))
2023-09-13 22:38:42 +00:00
Nikita Shulga
916183a012
[MPS] Fix crash if nonzero is called concurrently ( #108996 )
...
Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream.
Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device
Fixes https://github.com/pytorch/pytorch/issues/100285
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1662ce2</samp>
> _Sing, O Muse, of the swift and skillful coders_
> _Who fixed the dreadful deadlock of the stream_
> _That crashed the mighty tensors of the MPS_
> _When they sought out the nonzero elements._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996
Approved by: https://github.com/kulinseth
2023-09-13 19:28:47 +00:00