pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
Nikita Shulga	9dda4b20a0	[MPS] Enable select/[broad]cast ops for complex dtypes (#115727 ) By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727 Approved by: https://github.com/kulinseth	2023-12-19 02:25:28 +00:00
Peter Pham	74dfdc567b	[MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801 ) A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706 The follow test would fail without this bug fix: ``` import torch def test_erfinv(): for device in ['cpu', 'mps']: x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device) y = x[2:].erfinv() x2 = torch.tensor([0.3, 0.4, 0.5], device=device) y2 = x2.erfinv() print(y) print(y2) torch.testing.assert_close(y, y2) print(f"{device} passes.") test_erfinv() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801 Approved by: https://github.com/malfet	2023-12-15 23:14:03 +00:00
Lucas Steuernagel	2e517b20d9	[MPS] Add Conv3D support for MPS (#114183 ) Fixes #77818 I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged. I am leveraging @mattiaspaul work. Quoting the description here: > * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d. > * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors) > * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far) > * requires MacOS >=13.2 (Ventura) Please, let me know if there are any other changes needed and I'll be happy to implement them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183 Approved by: https://github.com/malfet	2023-12-15 23:05:01 +00:00
mingfeima	a8acd6c410	Add Half support for AvgPool2d on CPU (#109578 ) Add Half support for AvgPool2d (both channels last and channels first) on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-12-12 12:59:47 +00:00
igm503	f017a1af3f	[MPS] add complex_out to MPS backend (#110851 ) Adds support for at::complex_out to the MPS backend Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851 Approved by: https://github.com/kulinseth	2023-12-11 13:37:55 +00:00
Li-Huai (Allan) Lin	38e1440bae	[MPS] Remove redundant topk test and move all pad tests inside a class (#113313 ) Summary: 1. The removed `topk` test is essentially very similar to the following test, so I remove it: ```python def test_topk(self): def helper(shape): cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False) x = cpu_x.detach().clone().to('mps') for largest_val in [True, False]: if (type(shape) == tuple): for curr_dim in range(0, len(shape)): dim_size = shape[curr_dim] for k in range(1, dim_size + 1): topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val) topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val) self.assertEqual(topk_values, topk_values_cpu) self.assertEqual(topk_indices, topk_indices_cpu) else: for k in range(1, shape): topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val) topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val) self.assertEqual(topk_values, topk_values_cpu) self.assertEqual(topk_indices, topk_indices_cpu) helper(2) helper((5, 1)) helper((1, 5)) helper((5, 9, 7, 4)) helper((50, 20, 7, 4)) ``` `297c26bb8e/test/test_mps.py (L8054-L8091)` 2. Move all pad tests to one standalone class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313 Approved by: https://github.com/kulinseth ghstack dependencies: #113312	2023-12-01 06:52:07 +00:00
Li-Huai (Allan) Lin	88a659e752	[MPS] Move non-nll loss tests outside TestNLLLoss (#113312 ) The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312 Approved by: https://github.com/kulinseth	2023-12-01 06:52:07 +00:00
Nikita Shulga	1b27eae65e	[MPS] Fix out-of-bounds fill to sliced tensor (#114838 ) This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset. In `fill_mps_tensor_`: - Pass `value` argument to `MPSStream::fill` - Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero Add regression test Fixes https://github.com/pytorch/pytorch/issues/114692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838 Approved by: https://github.com/atalman, https://github.com/kulinseth	2023-12-01 06:24:42 +00:00
Khushi Agrawal	cff84871ce	[reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#114589 ) Previous PR: #113885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589 Approved by: https://github.com/lezcano	2023-11-27 14:45:44 +00:00
PyTorch MergeBot	150aaf46ca	Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#113885 )" This reverts commit `4fa1ff8404`. Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk `4fa1ff8404` ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473))	2023-11-27 07:33:00 +00:00
Khushi Agrawal	4fa1ff8404	[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#113885 ) Previous PR: https://github.com/pytorch/pytorch/pull/85202 Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885 Approved by: https://github.com/lezcano	2023-11-26 13:44:30 +00:00
Nikita Shulga	324cde59b2	[MPS] Fix test_copy_cast_no_leak (#114313 ) When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization. Fixes https://github.com/pytorch/pytorch/issues/114305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313 Approved by: https://github.com/huydhn	2023-11-22 14:48:24 +00:00
Nikita Shulga	b5dd37f23e	[MPS] Fix memory leak in copy_from_mps_ (#114197 ) By always calling `[destBuffer release]` before leaving the scope in which it was allocated. Leak was introduced by https://github.com/pytorch/pytorch/pull/84928 Add regression test. Before the change: ``` % python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10 test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL ====================================================================== FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper method(args, *kwargs) File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory") AssertionError: False is not true : Detected 65536 bytes leak of GPU memory To execute this test, run the following from the base repo dir: python test/test_mps.py -k test_copy_cast_no_leak This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 1.102s FAILED (failures=1) ``` After: ``` % python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10 . ---------------------------------------------------------------------- Ran 1 test in 0.819s OK . ---------------------------------------------------------------------- Ran 1 test in 0.001s OK . ---------------------------------------------------------------------- Ran 1 test in 0.002s OK ... ``` Fixes https://github.com/pytorch/pytorch/issues/114096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197 Approved by: https://github.com/kit1980	2023-11-21 14:52:55 +00:00
Li-Huai (Allan) Lin	538114db65	[MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085 ) Fixes #100764 This PR fixes the unary ops implementation and refactors the binary ops implementation a bit. For unary ops: Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think. For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125 See the added test for repro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085 Approved by: https://github.com/malfet	2023-11-14 22:03:21 +00:00
Nikita Shulga	265d6aac0b	[MPS] Fix crashes during Conv backward pass (#113398 ) By adding weights tensor to the MPSGraph cache key. Add regression test to validate that collision no longer happens Fixes https://github.com/pytorch/pytorch/issues/112998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398 Approved by: https://github.com/kulinseth	2023-11-10 04:29:33 +00:00
Li-Huai (Allan) Lin	740137df6f	[MPS] Add bucketize op (#112830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: #112829	2023-11-07 17:22:08 +00:00
Li-Huai (Allan) Lin	c4bb77323d	[MPS] Add searchsorted op (#112829 ) The metal kernels implemented are closely following `Bucketization.cu`. Benchmark: ``` [----------------------------- searchsorted ----------------------------] \| cpu \| mps 1 threads: -------------------------------------------------------------- Batch size: 8; In features: 64; Sorter: True \| 44 \| 530 Batch size: 8; In features: 64; Sorter: False \| 31 \| 12 Batch size: 8; In features: 256; Sorter: True \| 131 \| 520 Batch size: 8; In features: 256; Sorter: False \| 107 \| 12 Batch size: 8; In features: 1024; Sorter: True \| 499 \| 590 Batch size: 8; In features: 1024; Sorter: False \| 398 \| 12 Batch size: 16; In features: 64; Sorter: True \| 71 \| 540 Batch size: 16; In features: 64; Sorter: False \| 57 \| 12 Batch size: 16; In features: 256; Sorter: True \| 242 \| 610 Batch size: 16; In features: 256; Sorter: False \| 200 \| 12 Batch size: 16; In features: 1024; Sorter: True \| 999 \| 720 Batch size: 16; In features: 1024; Sorter: False \| 842 \| 12 Batch size: 32; In features: 64; Sorter: True \| 124 \| 509 Batch size: 32; In features: 64; Sorter: False \| 103 \| 12 Batch size: 32; In features: 256; Sorter: True \| 477 \| 650 Batch size: 32; In features: 256; Sorter: False \| 407 \| 12 Batch size: 32; In features: 1024; Sorter: True \| 1940 \| 833 Batch size: 32; In features: 1024; Sorter: False \| 1710 \| 12 Batch size: 64; In features: 64; Sorter: True \| 231 \| 590 Batch size: 64; In features: 64; Sorter: False \| 194 \| 12 Batch size: 64; In features: 256; Sorter: True \| 937 \| 710 Batch size: 64; In features: 256; Sorter: False \| 800 \| 13 Batch size: 64; In features: 1024; Sorter: True \| 3980 \| 1290 Batch size: 64; In features: 1024; Sorter: False \| 3330 \| 12 Batch size: 128; In features: 64; Sorter: True \| 448 \| 650 Batch size: 128; In features: 64; Sorter: False \| 390 \| 13 Batch size: 128; In features: 256; Sorter: True \| 1830 \| 850 Batch size: 128; In features: 256; Sorter: False \| 1590 \| 12 Batch size: 128; In features: 1024; Sorter: True \| 7790 \| 2850 Batch size: 128; In features: 1024; Sorter: False \| 6670 \| 13 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829 Approved by: https://github.com/malfet	2023-11-07 17:22:08 +00:00
CaoE	455241bbd3	Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138 ) Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138 Approved by: https://github.com/cpuhrsch	2023-11-06 06:01:29 +00:00
CaoE	26b5e27ace	Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132 ) Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132 Approved by: https://github.com/cpuhrsch	2023-11-05 12:31:38 +00:00
Li-Huai (Allan) Lin	30237aaeec	[MPS] Fix bug when value is of complex (#111937 ) When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937 Approved by: https://github.com/malfet ghstack dependencies: #111885	2023-10-31 17:50:56 +00:00
CaoE	a310cc8968	Add Half support for kthvalue, cross, hist, and logit on CPU (#112135 ) Add Half support for kthvalue, cross, hist, and logit on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135 Approved by: https://github.com/cpuhrsch	2023-10-31 09:12:47 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
Cao E	1c89ea7f72	Add Half support for softmax and log_softmax on CPU (#103315 ) Add Half support for softmax and log_softmax on CPU. Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet	2023-10-26 08:38:54 +00:00
Peter Bell	46e80ce58a	[ATen] Support multi dim any and all reductions (#110310 ) This adds a new overload to `all` and `any` with support for multiple reduction dims. ``` all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310 Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/justinchuby	2023-10-24 21:33:53 +00:00
Li-Huai (Allan) Lin	4b804dac33	[MPS] Add complex support for `fill` (#111885 ) Fixes #110537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885 Approved by: https://github.com/malfet	2023-10-24 06:41:10 +00:00
CaoE	4b324a8717	Add Half support for aminmax on CPU (#106853 ) Add Half support for aminmax on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853 Approved by: https://github.com/cpuhrsch	2023-10-23 17:43:47 +00:00
CaoE	d1afb7d43d	add Half support for multinomial on CPU (#104178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178 Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch	2023-10-20 19:16:04 +00:00
CaoE	2a40b7efcb	Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319 ) Add Half support for addcmul, addcdiv, cumsum, and topk on CPU. Note: This PR will introduce the issue https://github.com/pytorch/pytorch/issues/111454. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-10-19 17:47:45 +00:00
CaoE	8713a1a363	add Half support for bernoulli on CPU (#104176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176 Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch	2023-10-13 01:18:55 +00:00
Kurt Mohler	5292a92e03	Add `torch.unravel_index` (#110580 ) Fixes #35674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580 Approved by: https://github.com/lezcano, https://github.com/kulinseth	2023-10-12 00:55:51 +00:00
igm503	95ff51d8ed	[MPS] Add support for Softshrink to MPS Backend (#110814 ) Adds the softshrink activation function to the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814 Approved by: https://github.com/kulinseth	2023-10-11 07:55:39 +00:00
igm503	4b881b0da3	[MPS] add support for sgn to MPS backend (#110829 ) Fixes #86805 Adds support for sgn to MPS backend. Notes: 1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it. 2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works. 3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829 Approved by: https://github.com/malfet	2023-10-09 16:53:25 +00:00
vfdev-5	d2a2a67fa4	Added new test sample to interpolate op in OpInfo (#104181 ) Description: - Added new test sample to interpolate op in OpInfo - Fixed silent issue with zero tensor test sample for uint8 dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181 Approved by: https://github.com/pmeier, https://github.com/lezcano	2023-10-09 10:55:56 +00:00
igm503	a389181f2e	[MPS] add support for aten::nextafter (#109685 ) Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591 Adds support for aten::nextafter to the MPS backend. Supports float and half types. Notes: - I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions - Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685 Approved by: https://github.com/kulinseth	2023-10-03 19:20:22 +00:00
PyTorch MergeBot	df3ab70dde	Revert "Added new test sample to interpolate op in OpInfo (#104181 )" This reverts commit `87f8bc65f8`. Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))	2023-10-03 18:07:02 +00:00
vfdev-5	87f8bc65f8	Added new test sample to interpolate op in OpInfo (#104181 ) Description: - Added new test sample to interpolate op in OpInfo - Fixed silent issue with zero tensor test sample for uint8 dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181 Approved by: https://github.com/pmeier, https://github.com/lezcano	2023-10-02 15:35:48 +00:00
CaoE	9399e0b1ff	add fp16 support for gemm (#99498 ) ### Testing Native matmul vs. mkldnn matmul on SPR (with avx512_fp16 support) single core: Input \| Naïve impl / ms \| oneDNN / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 2010.387 \| 64.700 \| 31.072 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 4027.116 \| 107.780 \| 37.364 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 28685868.488 \| 90663.008 \| 316.401 56 cores: Input \| Naïve impl / ms \| oneDNN / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 5.091 \| 0.24 \| 211.30 M: 128, N: 128, K: 128, trans_a: False, trans_b: True \| 5.224 \| 0.23 \| 220.09 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 10.006 \| 0.30 \| 330.31 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 29435.372 \| 1.770 \| 1662.80 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \| 31464.961 \| 1.728 \| 18204.76 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 115035.849 \| 7.990 \| 14396.90 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 122981.023 \| 7.725 \| 15918.34 Batch: 768, M: 128, N: 64, K: 128 \| 2032.523 \| 0.705 \| 2882.23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-09-28 01:03:50 +00:00
Li-Huai (Allan) Lin	ac1e85161e	[MPS] Fix nll_loss with default ignore_index (#109574 ) `-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies. Fixes #108148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574 Approved by: https://github.com/kulinseth ghstack dependencies: #109557	2023-09-26 04:13:09 +00:00
Li-Huai (Allan) Lin	0087118997	[MPS] Fix mps to cpu copy with storage offset (#109557 ) Fix #108978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557 Approved by: https://github.com/DenisVieriu97	2023-09-26 04:13:08 +00:00
CaoE	7c9052165a	add fp16 support for native conv and deconv on CPU (#99497 ) ### Testing Native conv vs. mkldnn conv on SPR (with avx512_fp16 support) Single core: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 34676789 \| 524199.8 \| 66.15185 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 33454125 \| 349844.4 \| 95.62573 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 317650.1 \| 2317.677 \| 137.0554 IC: 128, OC: 256, kernel: 3, stride: 1, N: 1, L: 64 \| 15334.68 \| 167.264 \| 91.67952 56 cores: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 1032064 \| 11073.58 \| 93.20061 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 1000097 \| 16371.19 \| 61.08883 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 981813.4 \| 9008.908 \| 108.9825 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 1082606 \| 10150.47 \| 106.6558 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 319980.6 \| 181.598 \| 1762.027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-09-25 01:31:26 +00:00
igm503	255d1a776a	[MPS] Add support for Mish to MPS backend (#109786 ) Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444) Adds the mish activation function to the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786 Approved by: https://github.com/kulinseth	2023-09-21 21:01:20 +00:00
igm503	0317626df5	[MPS] adding weight_norm_interface support for mps (#108008 ) Fixes #104513 Adds support for aten::_weight_norm_interface to the mps backend. Also adds a consistency test for the output and the grad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008 Approved by: https://github.com/kulinseth	2023-09-20 02:18:28 +00:00
CaoE	54c28c564f	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima	2023-09-19 10:43:33 +00:00
PyTorch MergeBot	be9f73f031	Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211 )" This reverts commit `fe14e43d14`. Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing `492a93d185` https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))	2023-09-14 22:29:12 +00:00
Edward Z. Yang	fe14e43d14	Add meta and OpInfo for _embedding_bag_dense_backward (#109211 ) The sample inputs is a bit involved because there are a lot of shenanigans in the derivative formula. Check comments. This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-09-14 18:49:32 +00:00
PyTorch MergeBot	b226373d16	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `b6a1d3fb97`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main `b6a1d3fb97` https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))	2023-09-14 16:13:34 +00:00
CaoE	b6a1d3fb97	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-14 12:23:59 +00:00
PyTorch MergeBot	04a765f95d	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `6065e7a97c`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` `6065e7a97c` https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))	2023-09-13 22:38:42 +00:00
Nikita Shulga	916183a012	[MPS] Fix crash if nonzero is called concurrently (#108996 ) Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream. Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device Fixes https://github.com/pytorch/pytorch/issues/100285 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1662ce2</samp> > _Sing, O Muse, of the swift and skillful coders_ > _Who fixed the dreadful deadlock of the stream_ > _That crashed the mighty tensors of the MPS_ > _When they sought out the nonzero elements._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996 Approved by: https://github.com/kulinseth	2023-09-13 19:28:47 +00:00

1 2 3 4 5 ...

383 Commits