pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit `68751799b8`. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit `7d4cb21098`. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
PyTorch MergeBot	637cc8d27f	Revert "update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 )" This reverts commit `6367f02a0e`. Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main `6367f02a0e` ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))	2024-07-10 13:48:32 +00:00
Jiang, Yanbing	6367f02a0e	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-10 07:38:42 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
Joel Schlosser	c8ab2e8b63	Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238 ) This PR: * Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed). * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`: * Uncovered a bunch of test issues: * Test breakdown (>100 total) * A lot of tolerance issues (tweaked tolerance values to fix) * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype) * 3 actually broken semantics (for masked tensor; added xfails) * 4 Jacobian mismatches (added xfails) * 2 nan results (skip for now, need fixing) * 3 results too far from reference result (add xfails) * Skips MPS tests for now (there are so many failures!). Those will default to the old behavior. before (no seed setting): ``` real 0m21.306s user 0m19.053s sys 0m5.192s ``` after (with seed setting): ``` real 0m21.905s user 0m19.578s sys 0m5.390s ``` * Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command. ``` ====================================================================== ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper return test(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar self.assertFalse(True) AssertionError: True is not false The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper fn(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper raise new_e from e Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.037s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238 Approved by: https://github.com/janeyx99, https://github.com/justinchuby	2024-07-08 16:06:38 +00:00
PyTorch MergeBot	07450e9713	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit `6240cfd5c7`. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))	2024-07-02 12:29:51 +00:00
Kulin Seth	6240cfd5c7	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-07-02 01:49:52 +00:00
Huy Do	fdd0a7f9b4	Run test_mps_allocator_module serially (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-01 18:44:48 +00:00
Joona Havukainen	5b96a552df	Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484 ) Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value. The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484 Approved by: https://github.com/kulinseth	2024-06-28 20:57:40 +00:00
Manuel Candales	eabe6574c0	[metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628 Approved by: https://github.com/kimishpatel	2024-06-28 15:01:30 +00:00
Nikita Shulga	bc68907caa	[EZ][BE] Replace `assertTrue` with more appropriate checks (#129569 ) Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2024-06-26 16:29:59 +00:00
PyTorch MergeBot	b045878f81	Revert "Remove test_mps_allocator_module XFAIL (#129340 )" This reverts commit `c888ee3632`. Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))	2024-06-25 18:37:54 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
Isuru Fernando	5f912f480c	Fix max_pool2d decomposition for empty list and integer limits (#129106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #129096, #129097	2024-06-24 22:19:42 +00:00
Huy Do	c888ee3632	Remove test_mps_allocator_module XFAIL (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980	2024-06-24 16:26:38 +00:00
Manuel Candales	749c03406c	[metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965 ) Adds _convert_weight_to_int4pack MPS kernel Replaces previous int4mm Metal shader, with shader authored by @kimishpatel which improves perf by ~40% Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965 Approved by: https://github.com/malfet	2024-06-23 02:10:46 +00:00
Li-Huai (Allan) Lin	799acd31b4	[MPS] Add lu_factor (#99269 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d75cde1</samp> Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269 Approved by: https://github.com/kulinseth, https://github.com/lezcano	2024-06-20 07:35:29 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Joona Havukainen	d9eaa224f2	Fixes #128429 : NaN in triu op on MPS (#128575 ) Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead. Fixes #128429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575 Approved by: https://github.com/kulinseth	2024-06-18 03:44:42 +00:00
Nikita Shulga	9035fff2de	[BE] Do not test deprecated `torch.nn.utils.weight_norm` (#128727 ) Test `torch.nn.utils.parametrizations.weight_norm` instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727 Approved by: https://github.com/kit1980 ghstack dependencies: #128726	2024-06-14 19:14:44 +00:00
Nikita Shulga	27458cc097	[BE] Refactor repeated code in test_weight_norm (#128726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726 Approved by: https://github.com/kit1980	2024-06-14 19:14:44 +00:00
Tom Ritchford	edb45dce85	Add OpInfo entry for as_strided_copy (#127231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231 Approved by: https://github.com/lezcano	2024-06-13 13:58:47 +00:00
Nikita Shulga	0678742924	[MPS] Add Metal implementation of exp op (#128421 ) To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes https://github.com/pytorch/pytorch/issues/84936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421 Approved by: https://github.com/kulinseth ghstack dependencies: #128373, #128375	2024-06-13 06:53:17 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Tom Ritchford	2386045e4f	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-12 09:39:58 +00:00
Joona Havukainen	a5ba9b2858	Fix for addcdiv contiguous problem (#124442 ) Fixes issue number #118115 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442 Approved by: https://github.com/kulinseth	2024-06-06 16:09:18 +00:00
Huy Do	8992141dba	Restore MPS testing on MacOS 13 and m2 metal (#127853 ) The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out. This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582. The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853 Approved by: https://github.com/malfet	2024-06-05 14:44:00 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit `f03f8bc901`. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Nikita Shulga	045309aa35	[MPS] Enable toch.mm and friends for complex dtypes (#127241 ) - Add `supportedFloatingOrComplexType` - Change dtype check to those - Extend low-precision fp32 list to complex types - Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype Fixes https://github.com/pytorch/pytorch/issues/127178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241 Approved by: https://github.com/janeyx99	2024-05-28 17:56:13 +00:00
Nikita Shulga	4ff9113e3d	[MPS] Add `_weight_int8pack_mm` tests (#127041 ) As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041 Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales	2024-05-24 16:08:06 +00:00
jhavukainen	6a539e80dd	Update descriptor fields to resolve fft precision issue (#125328 ) Fixes #124096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-05-22 21:48:49 +00:00
jhavukainen	d28868c7e8	Change skipIfs to xfails in test_mps.py for test_isin (#125412 ) Follow-up to #124896 to move the added test to use expectedFailure instead of skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412 Approved by: https://github.com/kulinseth	2024-05-20 20:23:53 +00:00
Nikita Shulga	b8a706a321	[EZ][BE] Use `untyped_storage` in tests (#125838 ) Get's rid of the following warning: ``` /Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() if base.storage().data_ptr() != other.storage().data_ptr(): ``` (noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 ) Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838 Approved by: https://github.com/albanD	2024-05-09 14:04:21 +00:00
Nikita Shulga	4e29e80bf0	Run MPS tests on MacOS Sonoma (#125801 ) Those ones are running 14.4.1, so I wonder if they actually pass CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-05-09 13:43:12 +00:00
Denis Vieriu	58e045d03c	[MPS] Fix strided ELU op (#125692 ) Fixes https://github.com/pytorch/pytorch/issues/124834 Summary of changes: In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692 Approved by: https://github.com/kulinseth	2024-05-08 01:34:40 +00:00
Denis Vieriu	ba27548679	[MPS] Remove in place views (causes too many crashes) (#124895 ) Fixes https://github.com/pytorch/pytorch/issues/96153 Remove in place views as they are a general cause for many crashes. Proper fix to handle views without copies will come in a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895 Approved by: https://github.com/kulinseth	2024-05-08 01:00:37 +00:00
Denis Vieriu	3fb53bb6a7	[MPS] Fix strided mse_loss (#125696 ) Fixes https://github.com/pytorch/pytorch/issues/124621 Summary of changes: - In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696 Approved by: https://github.com/kulinseth	2024-05-08 00:52:26 +00:00
Nikita Shulga	0fd1fc17c3	[MPS] Fix `abs` for complex types (#125662 ) By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura. Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types Fixes https://github.com/pytorch/pytorch/issues/125135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662 Approved by: https://github.com/kulinseth	2024-05-07 22:15:20 +00:00
Nikita Shulga	30610251ec	[MPS] And naive quantized intmm and `.gputrace` capture hooks (#125163 ) - Implement a very straightforward Metal copy of CPU int4mm kernel - Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm - Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following ```cpp if (getMPSProfiler().profiler.isCaptureEnabled()) { getMPSProfiler().startCapture(__func__, mpsStream); } ... if (getMPSProfiler().isCapturing()) { getMPSProfiler().stopCapture(mpsStream); } ``` that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel <img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a"> - Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg` TODOs: - Add weight pack - Perf-tune both kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163 Approved by: https://github.com/mikekgfb	2024-05-03 15:20:39 +00:00
Denis Vieriu	a40d6df448	[MPS] Native nonzero implementation (#125355 ) Fixes https://github.com/pytorch/pytorch/issues/124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355 Approved by: https://github.com/kulinseth	2024-05-03 03:50:58 +00:00
Roy Hvaara	e15da7856c	[MPS] Fix overflow in cumsum when dtype is bool (#125318 ) `cumsum` and `cumprod` was (is?) buggy for MPS: `c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)` A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types. It turns out this issue also affects boolean types: ```python import torch print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1]) # tensor(-128, device='mps:0') ``` In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions. Fixes #96614 #106112 #109166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318 Approved by: https://github.com/malfet	2024-05-03 01:19:24 +00:00
Joona Havukainen	c451d108da	Implemented isin_Tensor_Tensor_out for MPS backend (#124896 ) Addresses issue #124518, adds isin_Tensor_Tensor_out. Tests added to test_mps.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-05-01 23:14:05 +00:00
Nikita Shulga	5944a53555	[MPS] Fix nextafter for negative values (#125029 ) By changing the logic to on older MacOS: ```cpp bits += ((input > 0) ^ (input > other)) ? 1 : -1; ``` And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available) TODO: - Add tests for infs and denorms Fixes https://github.com/pytorch/pytorch/issues/124985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029 Approved by: https://github.com/Skylion007	2024-04-27 02:58:05 +00:00
Nikita Shulga	db3a2d751c	[MPS][BE] Error-check linear (#124952 ) Validate that all arguments are on MPS devices and dtypes are expected Fixes cryptic messages like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))" RuntimeError: Placeholder storage has not been allocated on MPS device! ``` And hard crashes like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))" ``` Fixes https://github.com/pytorch/pytorch/issues/123995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952 Approved by: https://github.com/Skylion007	2024-04-25 23:25:20 +00:00
Nikita Shulga	abf3f90781	[MPS] Fix large copy (#124635 ) By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with: ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32' ``` Fixes https://github.com/pytorch/pytorch/issues/124335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635 Approved by: https://github.com/kulinseth	2024-04-22 23:43:11 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Joël Tang	a6a3f2e06b	[MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049 ) Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes. I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D. I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards(). This PR also fixes an issue with LeakyRELU on empty tensors. Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049 Approved by: https://github.com/kulinseth	2024-04-21 00:12:32 +00:00
Nikita Shulga	5677128cb8	[MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258 ) By creating constants using input tensors dtype One line reproducer: ``` python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))" ``` Before the change ``` loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible LLVM ERROR: Failed to infer result type(s). ``` After ``` tensor(-33.7812, device='mps:0', dtype=torch.float16) ``` Fixes https://github.com/pytorch/pytorch/issues/124252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258 Approved by: https://github.com/kulinseth	2024-04-18 15:21:01 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
Pearu Peterson	d2b0c0a34e	Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375 ) As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375 Approved by: https://github.com/zou3519, https://github.com/peterbell10	2024-04-17 15:31:28 +00:00
FFFrog	acc466751b	Add bfloat16 support to binary_cross_entropy for CPU (#123823 ) Fixes #123715 As the title stated. But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-04-17 09:44:07 +00:00
Joona Havukainen	05289a278c	Fix for MPS regression in #122016 and #123178 (#123234 ) Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds. Before the fix on MacOS 14.4: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 3., 3.], device='mps:0') ``` After the fix: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 1., 3.], device='mps:0') ``` This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+ Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-04-03 23:00:57 +00:00
PyTorch MergeBot	feabb645a7	Revert "Handle transposes in second batch of matrices in bmm (#122194 )" This reverts commit `251ad1232b`. Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))	2024-04-02 18:49:28 +00:00
Kulin Seth	251ad1232b	Handle transposes in second batch of matrices in bmm (#122194 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194 Approved by: https://github.com/DenisVieriu97	2024-04-02 17:48:35 +00:00
Nikita Shulga	4c70ab26ef	[MPS] Enable `index_select` for complex types (#122590 ) Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick Fixes https://github.com/pytorch/pytorch/issues/122427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590 Approved by: https://github.com/Skylion007	2024-03-25 16:57:35 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Roger Lam	40acc84aaf	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-18 19:38:15 +00:00
PyTorch MergeBot	0cc60a05da	Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381 )" This reverts commit `ca80d07ac7`. Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR. Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))	2024-03-15 23:53:05 +00:00
Roger Lam	ca80d07ac7	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-15 21:54:50 +00:00
Nikita Shulga	5498804ec2	[MPS] Fix naive matmul for BFloat16 (#121731 ) Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate Fixes https://github.com/pytorch/pytorch/issues/121583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731 Approved by: https://github.com/albanD	2024-03-13 14:34:03 +00:00
Nikita Shulga	07330ff7b6	[MPS][BE] Define `_compute_tolerances` (#121754 ) Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match` So move tolerance definition logic into a shared `_compute_tolerances` function and only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions. Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754 Approved by: https://github.com/albanD	2024-03-13 04:08:06 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
Nikita Shulga	9b03a06288	[BE] [MPS] Fix `out` resize logic in `torch.where` (#121476 ) By deleting `where_mps` and registering MPS dispatch for `where_kernel`. As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends. Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473, #121494	2024-03-08 18:59:37 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Kai	c59b14163b	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2024-02-26 23:04:52 +00:00
Nikita Shulga	53bfae2c06	[MPS] Add `torch.fft.` support (#119670 ) Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 ) When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing. Fixes https://github.com/pytorch/pytorch/issues/78044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670 Approved by: https://github.com/kulinseth, https://github.com/albanD	2024-02-20 18:23:06 +00:00
Nikita Shulga	eb9a3383c2	[MPS] Add naive std_mean implementation (#119777 ) By just calling `std_mps` and `mean` in sequence Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script: ```python from timeit import default_timer import torch from torch.utils.benchmark import Measurement, Timer def bench_var_mean( m, n, k, dtype = torch.float32, device:str = "cpu", ) -> Measurement: setup = f""" x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}") """ t = Timer( stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer ) return t.blocked_autorange() for x in [100, 1000]: rc = bench_var_mean(1000, x, 100, device="mps") print(f"{x:5} : {rc.mean*1e6:.2f} usec") ``` which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter. Fixes https://github.com/pytorch/pytorch/issues/119663 TODOs: - Refactor the codebase and implement proper composite function (that must be faster) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777 Approved by: https://github.com/albanD	2024-02-13 21:51:29 +00:00
Nikita Shulga	15ef52a015	[MPS] Enable `conj` and `conj_physical` (#119669 ) Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor Preliminary step for enabling FFT ops (without it `ifft` would never work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669 Approved by: https://github.com/albanD ghstack dependencies: #119681	2024-02-13 02:27:51 +00:00
Nikita Shulga	8d8fb9783c	[MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681 ) By using `view_as_real` when type casting between two complex types Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-12 19:09:10 +00:00
Pearu Peterson	2c91e13afc	Add lowerings to special functions (#119187 ) As in the title. In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187 Approved by: https://github.com/peterbell10	2024-02-11 16:35:40 +00:00
Nikita Shulga	4ee8aac432	[MPS] Enable `bfloat16` support on MacOS 14 (#119641 ) Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h` TODO: Enable more testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641 Approved by: https://github.com/Skylion007	2024-02-11 16:25:29 +00:00
Nikita Shulga	1d61011c11	[MPS] Add support for complex scalars (#119318 ) - Switch to native complex support if running on MacOS Monterey or newer for binary ops. - Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats - Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union - Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view. - Add `test_tensor_scalar_binops` Fixes https://github.com/pytorch/pytorch/issues/119088 Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318 Approved by: https://github.com/albanD	2024-02-08 18:10:59 +00:00
watarungurunnn	d444a3b443	[MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771 ) Fixes #114285 (However, still have NotImplementedError ```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771 Approved by: https://github.com/lezcano	2024-02-05 15:36:55 +00:00
lancerts	26a2743162	Fix placeholder tensor is empty for relu in mps (#118965 ) Fixes #118845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965 Approved by: https://github.com/malfet	2024-02-03 23:50:35 +00:00
Nikita Shulga	24dd9f42ce	[MPS] Fix `use_metal_mm` condition (#118830 ) One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)` Extend test to account for this situation Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830 Approved by: https://github.com/huydhn	2024-02-01 17:53:42 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
Sun, Jiayi	2dd4a254a0	add Half support for interpolate operators on CPU (#105648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 09:07:16 +00:00
Nikita Shulga	1872834247	[MPS] Fix `torch.mm` correctness for large matrices (#117549 ) Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows: ```objc NSMutableArray<MPSGraphTensor> rows = [NSMutableArray new]; for (int64_t i = 0; i < M; i += tile_size) { const auto i_end = std::min(i + tile_size, M); NSMutableArray<MPSGraphTensor> row_chunks = [NSMutableArray new]; for (int64_t j = 0; j < K; j += tile_size) { const auto j_end = std::min(j + tile_size, K); MPSGraphTensor* tile = nil; for (int64_t k = 0; k < N; k += tile_size) { const auto k_end = std::min(k + tile_size, N); auto selfChunk = [graph sliceTensor:selfTensor starts:@[ @(i), @(k) ] ends:@[ @(i_end), @(k_end) ] strides:@[ @(1), @(1) ] name:nil]; auto otherChunk = [graph sliceTensor:otherTensor starts:@[ @(k), @(j) ] ends:@[ @(k_end), @(j_end) ] strides:@[ @(1), @(1) ] name:nil]; auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil]; tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM; } [row_chunks addObject:tile]; } auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject; [rows addObject:row]; } return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject; ``` One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable Fixes https://github.com/pytorch/pytorch/issues/116769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549 Approved by: https://github.com/kulinseth	2024-01-17 01:33:08 +00:00
Nikita Shulga	6784030df4	[MPS] Add support for 64-bit index operations (#116942 ) But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942 Approved by: https://github.com/Skylion007, https://github.com/kulinseth ghstack dependencies: #116903, #116904, #116915, #116940	2024-01-09 16:56:49 +00:00
Nikita Shulga	ff0f79d3c7	[MPS] Mark `torch.[all\|any]` as working with complex on MacOS14 (#116907 ) It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907 Approved by: https://github.com/osalpekar, https://github.com/kit1980	2024-01-06 01:10:11 +00:00
Nikita Shulga	b0393ebe9b	[MPS] Make test_mps.py passable on Sonoma (#116764 ) - Enable Sonoma testing on M2 machines - Add 70+ ops to the list of supported ones on MacOS Sonoma - Enable nn.functional. - Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as: ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>' (mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>> ``` - Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again) - `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764 Approved by: https://github.com/kulinseth, https://github.com/seemethere	2024-01-05 00:25:47 +00:00
Gao Tianlin	6793b99107	[BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457 ) The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault. If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal: ``` [call] op=[aten::quantize_per_tensor], key=[AutogradCPU] [redispatch] op=[aten::quantize_per_tensor], key=[CPU] [call] op=[aten::any.dims], key=[AutogradCPU] [redispatch] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] [call] op=[aten::empty.memory_format], key=[BackendSelect] [redispatch] op=[aten::empty.memory_format], key=[CPU] [call] op=[aten::any.dims_out], key=[QuantizedCPU] [call] op=[aten::any.dims], key=[QuantizedCPU] ..... ..... ..... ``` Fixes #116452 Fixes #116451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457 Approved by: https://github.com/malfet	2024-01-04 17:37:17 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Nikita Shulga	09ee96b69d	[MPS] Fix CrossEntropyLoss for float16 (#116597 ) Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e. > `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)` Alas, at the moment it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU) Fixes https://github.com/pytorch/pytorch/issues/116095 TODO: Enable testing via TestModules, but will do in separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597 Approved by: https://github.com/kulinseth	2024-01-03 05:58:26 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
PyTorch MergeBot	0978482afa	Revert "Implement aten::upsample_linear1d on mps (#115031 )" This reverts commit `c6969cb8a9`. Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))	2023-12-26 18:01:49 +00:00
Kai	c6969cb8a9	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2023-12-26 15:44:21 +00:00
Aaron Gokaslan	6de28e92d2	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 ) This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027 Approved by: https://github.com/malfet	2023-12-20 19:35:08 +00:00
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
Nikita Shulga	9dda4b20a0	[MPS] Enable select/[broad]cast ops for complex dtypes (#115727 ) By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727 Approved by: https://github.com/kulinseth	2023-12-19 02:25:28 +00:00
Peter Pham	74dfdc567b	[MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801 ) A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706 The follow test would fail without this bug fix: ``` import torch def test_erfinv(): for device in ['cpu', 'mps']: x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device) y = x[2:].erfinv() x2 = torch.tensor([0.3, 0.4, 0.5], device=device) y2 = x2.erfinv() print(y) print(y2) torch.testing.assert_close(y, y2) print(f"{device} passes.") test_erfinv() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801 Approved by: https://github.com/malfet	2023-12-15 23:14:03 +00:00
Lucas Steuernagel	2e517b20d9	[MPS] Add Conv3D support for MPS (#114183 ) Fixes #77818 I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged. I am leveraging @mattiaspaul work. Quoting the description here: > * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d. > * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors) > * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far) > * requires MacOS >=13.2 (Ventura) Please, let me know if there are any other changes needed and I'll be happy to implement them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183 Approved by: https://github.com/malfet	2023-12-15 23:05:01 +00:00
mingfeima	a8acd6c410	Add Half support for AvgPool2d on CPU (#109578 ) Add Half support for AvgPool2d (both channels last and channels first) on CPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/109578 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-12-12 12:59:47 +00:00
igm503	f017a1af3f	[MPS] add complex_out to MPS backend (#110851 ) Adds support for at::complex_out to the MPS backend Implemented in a binary kernel using the view_as_real pattern for handling complex dtypes in the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110851 Approved by: https://github.com/kulinseth	2023-12-11 13:37:55 +00:00
Li-Huai (Allan) Lin	38e1440bae	[MPS] Remove redundant topk test and move all pad tests inside a class (#113313 ) Summary: 1. The removed `topk` test is essentially very similar to the following test, so I remove it: ```python def test_topk(self): def helper(shape): cpu_x = torch.randn(shape, device='cpu', dtype=torch.float, requires_grad=False) x = cpu_x.detach().clone().to('mps') for largest_val in [True, False]: if (type(shape) == tuple): for curr_dim in range(0, len(shape)): dim_size = shape[curr_dim] for k in range(1, dim_size + 1): topk_values, topk_indices = torch.topk(x, k, dim=curr_dim, largest=largest_val) topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=curr_dim, largest=largest_val) self.assertEqual(topk_values, topk_values_cpu) self.assertEqual(topk_indices, topk_indices_cpu) else: for k in range(1, shape): topk_values, topk_indices = torch.topk(x, k, dim=0, largest=largest_val) topk_values_cpu, topk_indices_cpu = torch.topk(cpu_x, k, dim=0, largest=largest_val) self.assertEqual(topk_values, topk_values_cpu) self.assertEqual(topk_indices, topk_indices_cpu) helper(2) helper((5, 1)) helper((1, 5)) helper((5, 9, 7, 4)) helper((50, 20, 7, 4)) ``` `297c26bb8e/test/test_mps.py (L8054-L8091)` 2. Move all pad tests to one standalone class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113313 Approved by: https://github.com/kulinseth ghstack dependencies: #113312	2023-12-01 06:52:07 +00:00
Li-Huai (Allan) Lin	88a659e752	[MPS] Move non-nll loss tests outside TestNLLLoss (#113312 ) The diff looks messy but this PR essentially does one thing: Move non-nll loss tests in `TestNLLLoss` class to `TestMPS` class. After doing so, it ends up having two stack tests the same name `test_stack` ; therefore, I rename one of them to `test_stack_storage_offset`, which is what the test actually does. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113312 Approved by: https://github.com/kulinseth	2023-12-01 06:52:07 +00:00
Nikita Shulga	1b27eae65e	[MPS] Fix out-of-bounds fill to sliced tensor (#114838 ) This fixes regression introduced by https://github.com/pytorch/pytorch/pull/81951 that caused out-of-bounds access when sliced tensor is filled with zeros Remove bogus `TORCH_INTERNAL_ASSERT(length >= offset)` as [NSMakeRange](https://developer.apple.com/documentation/foundation/1417188-nsmakerange?language=objc) arguments are location and length rather than start and end offset. In `fill_mps_tensor_`: - Pass `value` argument to `MPSStream::fill` - Pass `self.nbytes()` rather than `self.storage().nbytes()` as length of of buffer to fill as later will always results in out-of-bounds write if offset within the store is non-zero Add regression test Fixes https://github.com/pytorch/pytorch/issues/114692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114838 Approved by: https://github.com/atalman, https://github.com/kulinseth	2023-12-01 06:24:42 +00:00
Khushi Agrawal	cff84871ce	[reland][opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#114589 ) Previous PR: #113885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114589 Approved by: https://github.com/lezcano	2023-11-27 14:45:44 +00:00
PyTorch MergeBot	150aaf46ca	Revert "[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#113885 )" This reverts commit `4fa1ff8404`. Reverted https://github.com/pytorch/pytorch/pull/113885 on behalf of https://github.com/huydhn due to Sorry for reverting you change but its TestCommonCUDA::test_compare_cpu_nn_functional_conv3d test failing in trunk `4fa1ff8404` ([comment](https://github.com/pytorch/pytorch/pull/113885#issuecomment-1827268473))	2023-11-27 07:33:00 +00:00
Khushi Agrawal	4fa1ff8404	[opinfo][fix] conv3d & fix conv{1, 2}d for neg dilation\|groups & add ErrorInputs for conv ops (#113885 ) Previous PR: https://github.com/pytorch/pytorch/pull/85202 Also, cc'ing @lezcano @kshitij12345 @zou3519, who reviewed my previous PR. Thanks! Pull Request resolved: https://github.com/pytorch/pytorch/pull/113885 Approved by: https://github.com/lezcano	2023-11-26 13:44:30 +00:00
Nikita Shulga	324cde59b2	[MPS] Fix test_copy_cast_no_leak (#114313 ) When running on MacOS-13.2 test always fails on first run, but succeeds on the second as presumably it reserves some memory to cache f32->f16 graph. Make it resilient against such failures by adding a warmup step when one conversion is performed before recording driver memory utilization. Fixes https://github.com/pytorch/pytorch/issues/114305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114313 Approved by: https://github.com/huydhn	2023-11-22 14:48:24 +00:00
Nikita Shulga	b5dd37f23e	[MPS] Fix memory leak in copy_from_mps_ (#114197 ) By always calling `[destBuffer release]` before leaving the scope in which it was allocated. Leak was introduced by https://github.com/pytorch/pytorch/pull/84928 Add regression test. Before the change: ``` % python ../test/test_mps.py -v -k test_copy_cast_no_leak --repeat 10 test_copy_cast_no_leak (__main__.TestMemoryLeak) ... FAIL ====================================================================== FAIL: test_copy_cast_no_leak (__main__.TestMemoryLeak) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/nshulga/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2554, in wrapper method(args, *kwargs) File "/Users/nshulga/git/pytorch/pytorch/build/../test/test_mps.py", line 1064, in test_copy_cast_no_leak self.assertTrue(driver_before == driver_after, f"Detected {driver_after-driver_before} bytes leak of GPU memory") AssertionError: False is not true : Detected 65536 bytes leak of GPU memory To execute this test, run the following from the base repo dir: python test/test_mps.py -k test_copy_cast_no_leak This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 1.102s FAILED (failures=1) ``` After: ``` % python ../test/test_mps.py -k test_copy_cast_no_leak --repeat 10 . ---------------------------------------------------------------------- Ran 1 test in 0.819s OK . ---------------------------------------------------------------------- Ran 1 test in 0.001s OK . ---------------------------------------------------------------------- Ran 1 test in 0.002s OK ... ``` Fixes https://github.com/pytorch/pytorch/issues/114096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114197 Approved by: https://github.com/kit1980	2023-11-21 14:52:55 +00:00
Li-Huai (Allan) Lin	538114db65	[MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous output (#97085 ) Fixes #100764 This PR fixes the unary ops implementation and refactors the binary ops implementation a bit. For unary ops: Previously we didn't take into account unary ops that have a non-contiguous/storage-offset output, causing an incorrect result (because the MPS graph kernel always writes the buffer contiguously). Therefore, this PR creates a temporary output tensor for the graph first and then copy the result back to the original output tensor. We currently do not have a better fix other than this I think. For binary ops, see https://github.com/pytorch/pytorch/pull/97085#discussion_r1140999125 See the added test for repro. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97085 Approved by: https://github.com/malfet	2023-11-14 22:03:21 +00:00
Nikita Shulga	265d6aac0b	[MPS] Fix crashes during Conv backward pass (#113398 ) By adding weights tensor to the MPSGraph cache key. Add regression test to validate that collision no longer happens Fixes https://github.com/pytorch/pytorch/issues/112998 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113398 Approved by: https://github.com/kulinseth	2023-11-10 04:29:33 +00:00
Li-Huai (Allan) Lin	740137df6f	[MPS] Add bucketize op (#112830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112830 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: #112829	2023-11-07 17:22:08 +00:00
Li-Huai (Allan) Lin	c4bb77323d	[MPS] Add searchsorted op (#112829 ) The metal kernels implemented are closely following `Bucketization.cu`. Benchmark: ``` [----------------------------- searchsorted ----------------------------] \| cpu \| mps 1 threads: -------------------------------------------------------------- Batch size: 8; In features: 64; Sorter: True \| 44 \| 530 Batch size: 8; In features: 64; Sorter: False \| 31 \| 12 Batch size: 8; In features: 256; Sorter: True \| 131 \| 520 Batch size: 8; In features: 256; Sorter: False \| 107 \| 12 Batch size: 8; In features: 1024; Sorter: True \| 499 \| 590 Batch size: 8; In features: 1024; Sorter: False \| 398 \| 12 Batch size: 16; In features: 64; Sorter: True \| 71 \| 540 Batch size: 16; In features: 64; Sorter: False \| 57 \| 12 Batch size: 16; In features: 256; Sorter: True \| 242 \| 610 Batch size: 16; In features: 256; Sorter: False \| 200 \| 12 Batch size: 16; In features: 1024; Sorter: True \| 999 \| 720 Batch size: 16; In features: 1024; Sorter: False \| 842 \| 12 Batch size: 32; In features: 64; Sorter: True \| 124 \| 509 Batch size: 32; In features: 64; Sorter: False \| 103 \| 12 Batch size: 32; In features: 256; Sorter: True \| 477 \| 650 Batch size: 32; In features: 256; Sorter: False \| 407 \| 12 Batch size: 32; In features: 1024; Sorter: True \| 1940 \| 833 Batch size: 32; In features: 1024; Sorter: False \| 1710 \| 12 Batch size: 64; In features: 64; Sorter: True \| 231 \| 590 Batch size: 64; In features: 64; Sorter: False \| 194 \| 12 Batch size: 64; In features: 256; Sorter: True \| 937 \| 710 Batch size: 64; In features: 256; Sorter: False \| 800 \| 13 Batch size: 64; In features: 1024; Sorter: True \| 3980 \| 1290 Batch size: 64; In features: 1024; Sorter: False \| 3330 \| 12 Batch size: 128; In features: 64; Sorter: True \| 448 \| 650 Batch size: 128; In features: 64; Sorter: False \| 390 \| 13 Batch size: 128; In features: 256; Sorter: True \| 1830 \| 850 Batch size: 128; In features: 256; Sorter: False \| 1590 \| 12 Batch size: 128; In features: 1024; Sorter: True \| 7790 \| 2850 Batch size: 128; In features: 1024; Sorter: False \| 6670 \| 13 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112829 Approved by: https://github.com/malfet	2023-11-07 17:22:08 +00:00
CaoE	455241bbd3	Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138 ) Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112138 Approved by: https://github.com/cpuhrsch	2023-11-06 06:01:29 +00:00
CaoE	26b5e27ace	Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132 ) Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112132 Approved by: https://github.com/cpuhrsch	2023-11-05 12:31:38 +00:00
Li-Huai (Allan) Lin	30237aaeec	[MPS] Fix bug when value is of complex (#111937 ) When the value of `fill` is of complex, this line `value.toDouble() == 0.0` will error out saying that converting complex to double will cause overflow. So we should firstly handle the complex value and then enter this condition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111937 Approved by: https://github.com/malfet ghstack dependencies: #111885	2023-10-31 17:50:56 +00:00
CaoE	a310cc8968	Add Half support for kthvalue, cross, hist, and logit on CPU (#112135 ) Add Half support for kthvalue, cross, hist, and logit on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135 Approved by: https://github.com/cpuhrsch	2023-10-31 09:12:47 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
Cao E	1c89ea7f72	Add Half support for softmax and log_softmax on CPU (#103315 ) Add Half support for softmax and log_softmax on CPU. Note: This introduces a correctness issue with MPS https://github.com/pytorch/pytorch/issues/111416 and https://github.com/pytorch/pytorch/issues/111479. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103315 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/malfet	2023-10-26 08:38:54 +00:00
Peter Bell	46e80ce58a	[ATen] Support multi dim any and all reductions (#110310 ) This adds a new overload to `all` and `any` with support for multiple reduction dims. ``` all.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor any.dims(Tensor self, int[1]? dim=None, bool keepdim=False) -> Tensor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110310 Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/justinchuby	2023-10-24 21:33:53 +00:00
Li-Huai (Allan) Lin	4b804dac33	[MPS] Add complex support for `fill` (#111885 ) Fixes #110537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111885 Approved by: https://github.com/malfet	2023-10-24 06:41:10 +00:00
CaoE	4b324a8717	Add Half support for aminmax on CPU (#106853 ) Add Half support for aminmax on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106853 Approved by: https://github.com/cpuhrsch	2023-10-23 17:43:47 +00:00
CaoE	d1afb7d43d	add Half support for multinomial on CPU (#104178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178 Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch	2023-10-20 19:16:04 +00:00
CaoE	2a40b7efcb	Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319 ) Add Half support for addcmul, addcdiv, cumsum, and topk on CPU. Note: This PR will introduce the issue https://github.com/pytorch/pytorch/issues/111454. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103319 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-10-19 17:47:45 +00:00
CaoE	8713a1a363	add Half support for bernoulli on CPU (#104176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176 Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch	2023-10-13 01:18:55 +00:00
Kurt Mohler	5292a92e03	Add `torch.unravel_index` (#110580 ) Fixes #35674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110580 Approved by: https://github.com/lezcano, https://github.com/kulinseth	2023-10-12 00:55:51 +00:00
igm503	95ff51d8ed	[MPS] Add support for Softshrink to MPS Backend (#110814 ) Adds the softshrink activation function to the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110814 Approved by: https://github.com/kulinseth	2023-10-11 07:55:39 +00:00
igm503	4b881b0da3	[MPS] add support for sgn to MPS backend (#110829 ) Fixes #86805 Adds support for sgn to MPS backend. Notes: 1. @malfet self-assigned this when he was working on implementing polar, but from what I can tell, he didn't end up needing to implement it. 2. @Berzeg implemented this last year, before view_as_complex was supported. Because of @malfet recent contributions, however, @Berzeg 's implementation works. I've removed the part of his implementation that dealt with non-complex dtypes (since these can just be passed to at::sign), matched the more recent pattern we've been using in UnaryOps.mm, and thrown in a simple implementation of _efficientzerotensor for mps, so that the backward function works. 3. @Berzeg deserves a good bit of credit for this, so let me know if there's a way to assign him some without jamming up the pr (he seems to be AWOL since last working on this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110829 Approved by: https://github.com/malfet	2023-10-09 16:53:25 +00:00
vfdev-5	d2a2a67fa4	Added new test sample to interpolate op in OpInfo (#104181 ) Description: - Added new test sample to interpolate op in OpInfo - Fixed silent issue with zero tensor test sample for uint8 dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181 Approved by: https://github.com/pmeier, https://github.com/lezcano	2023-10-09 10:55:56 +00:00
igm503	a389181f2e	[MPS] add support for aten::nextafter (#109685 ) Fixes https://github.com/pytorch/pytorch/issues/77764#issuecomment-1722515591 Adds support for aten::nextafter to the MPS backend. Supports float and half types. Notes: - I've added nextafter to the output_grad_check XFAILLIST since neither this nor the cpu implementations have grad functions - Metal Shading Language 3.1 seems to have a native nextafter() function, so once that's available, this kernel can just call that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109685 Approved by: https://github.com/kulinseth	2023-10-03 19:20:22 +00:00
PyTorch MergeBot	df3ab70dde	Revert "Added new test sample to interpolate op in OpInfo (#104181 )" This reverts commit `87f8bc65f8`. Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))	2023-10-03 18:07:02 +00:00
vfdev-5	87f8bc65f8	Added new test sample to interpolate op in OpInfo (#104181 ) Description: - Added new test sample to interpolate op in OpInfo - Fixed silent issue with zero tensor test sample for uint8 dtype Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181 Approved by: https://github.com/pmeier, https://github.com/lezcano	2023-10-02 15:35:48 +00:00
CaoE	9399e0b1ff	add fp16 support for gemm (#99498 ) ### Testing Native matmul vs. mkldnn matmul on SPR (with avx512_fp16 support) single core: Input \| Naïve impl / ms \| oneDNN / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 2010.387 \| 64.700 \| 31.072 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 4027.116 \| 107.780 \| 37.364 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 28685868.488 \| 90663.008 \| 316.401 56 cores: Input \| Naïve impl / ms \| oneDNN / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 5.091 \| 0.24 \| 211.30 M: 128, N: 128, K: 128, trans_a: False, trans_b: True \| 5.224 \| 0.23 \| 220.09 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 10.006 \| 0.30 \| 330.31 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 29435.372 \| 1.770 \| 1662.80 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \| 31464.961 \| 1.728 \| 18204.76 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 115035.849 \| 7.990 \| 14396.90 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 122981.023 \| 7.725 \| 15918.34 Batch: 768, M: 128, N: 64, K: 128 \| 2032.523 \| 0.705 \| 2882.23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-09-28 01:03:50 +00:00
Li-Huai (Allan) Lin	ac1e85161e	[MPS] Fix nll_loss with default ignore_index (#109574 ) `-100` should be a valid `ignore_index` as indicated in the linked issue. This PR also cleans up some unnecessary MPSTensor copies. Fixes #108148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109574 Approved by: https://github.com/kulinseth ghstack dependencies: #109557	2023-09-26 04:13:09 +00:00
Li-Huai (Allan) Lin	0087118997	[MPS] Fix mps to cpu copy with storage offset (#109557 ) Fix #108978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109557 Approved by: https://github.com/DenisVieriu97	2023-09-26 04:13:08 +00:00
CaoE	7c9052165a	add fp16 support for native conv and deconv on CPU (#99497 ) ### Testing Native conv vs. mkldnn conv on SPR (with avx512_fp16 support) Single core: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 34676789 \| 524199.8 \| 66.15185 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 33454125 \| 349844.4 \| 95.62573 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 317650.1 \| 2317.677 \| 137.0554 IC: 128, OC: 256, kernel: 3, stride: 1, N: 1, L: 64 \| 15334.68 \| 167.264 \| 91.67952 56 cores: Input \| Naïve impl / us \| oneDNN / us \| Speed up -- \| -- \| -- \| -- IC: 64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 \| 1032064 \| 11073.58 \| 93.20061 IC: 128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 \| 1000097 \| 16371.19 \| 61.08883 IC: 256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 981813.4 \| 9008.908 \| 108.9825 IC: 1024, OC: 256, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 \| 1082606 \| 10150.47 \| 106.6558 IC: 256, OC: 256, kernel: 3, stride: 1, N: 1, H: 16, W: 16, G: 1, pad: 0 \| 319980.6 \| 181.598 \| 1762.027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99497 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-09-25 01:31:26 +00:00
igm503	255d1a776a	[MPS] Add support for Mish to MPS backend (#109786 ) Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/77764#issuecomment-1712894444) Adds the mish activation function to the mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109786 Approved by: https://github.com/kulinseth	2023-09-21 21:01:20 +00:00
igm503	0317626df5	[MPS] adding weight_norm_interface support for mps (#108008 ) Fixes #104513 Adds support for aten::_weight_norm_interface to the mps backend. Also adds a consistency test for the output and the grad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108008 Approved by: https://github.com/kulinseth	2023-09-20 02:18:28 +00:00
CaoE	54c28c564f	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki, https://github.com/mingfeima	2023-09-19 10:43:33 +00:00
PyTorch MergeBot	be9f73f031	Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211 )" This reverts commit `fe14e43d14`. Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing `492a93d185` https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))	2023-09-14 22:29:12 +00:00
Edward Z. Yang	fe14e43d14	Add meta and OpInfo for _embedding_bag_dense_backward (#109211 ) The sample inputs is a bit involved because there are a lot of shenanigans in the derivative formula. Check comments. This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211 Approved by: https://github.com/albanD, https://github.com/zou3519	2023-09-14 18:49:32 +00:00
PyTorch MergeBot	b226373d16	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `b6a1d3fb97`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to I'm very sorry but it looks like #106543 was not fixed, I still see it failing on main `b6a1d3fb97` https://github.com/pytorch/pytorch/actions/runs/6185704949/job/16793975677 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1719747065))	2023-09-14 16:13:34 +00:00
CaoE	b6a1d3fb97	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-14 12:23:59 +00:00
PyTorch MergeBot	04a765f95d	Revert "add Half support for BatchNorm on CPU (#102070 )" This reverts commit `6065e7a97c`. Reverted https://github.com/pytorch/pytorch/pull/102070 on behalf of https://github.com/clee2000 due to sorry it looks like this is causing an unexpected success for `test_jit_fuser_te.py::TestNNCOpInfoCPU::test_nnc_correctness_nn_functional_batch_norm_cpu_float16` `6065e7a97c` https://github.com/pytorch/pytorch/actions/runs/6178069462/job/16770849782 ([comment](https://github.com/pytorch/pytorch/pull/102070#issuecomment-1718402208))	2023-09-13 22:38:42 +00:00
Nikita Shulga	916183a012	[MPS] Fix crash if nonzero is called concurrently (#108996 ) Surrounds `stream->synchronize()` call with `dispatch_sync(stream->queue(), ^{});`, which is a noop for signle threaded program, but serializes calls to the synchronize across the threads using the same stream. Prevent `[IOGPUMetalCommandBuffer validate]:215: failed assertion 'commit an already committed command buffer'` non-recoverable exception, which is triggered every time one is using PyCharm to inspect tensors on MPS device Fixes https://github.com/pytorch/pytorch/issues/100285 <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 1662ce2</samp> > _Sing, O Muse, of the swift and skillful coders_ > _Who fixed the dreadful deadlock of the stream_ > _That crashed the mighty tensors of the MPS_ > _When they sought out the nonzero elements._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/108996 Approved by: https://github.com/kulinseth	2023-09-13 19:28:47 +00:00
CaoE	6065e7a97c	add Half support for BatchNorm on CPU (#102070 ) Fixes #106543 ### Testing Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.7116 \| 0.1427 \| 0.1744 \| 0.2638 \| 0.2002 \| 0.2556 (1, 32, 100, 100) \| 0.8579 \| 0.1725 \| 0.2077 \| 0.3023 \| 0.2399 \| 0.2995 (32, 16, 200, 200) \| 57.3466 \| 12.2179 \| 13.1320 \| 45.9524 \| 24.1526 \| 24.9882 28 cores: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- (1, 4, 256, 256) \| 0.2571 \| 0.0713 \| 0.0846 \| 0.1140 \| 0.0883 \| 0.1043 (1, 32, 100, 100) \| 0.1077 \| 0.0510 \| 0.0548 \| 0.0700 \| 0.0645 \| 0.0713 (32, 16, 200, 200) \| 5.5060 \| 1.4195 \| 1.4663 \| 6.773 \| 3.0886 \| 3.1343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102070 Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki	2023-09-13 17:30:16 +00:00
igm503	1b9b3a2d15	[MPS] Adding lgamma, digamma, and polygamma implementations (#106292 ) Fixes issue mentioned in #77764 e.g. https://github.com/pytorch/pytorch/issues/77764#issuecomment-1654111744 Adds MPS support for the following ops: - lgamma - mvlgamma - digamma - polygamma The lgamma fucntion does not yet have an MPS backend implementation. I've added one using a custom metal kernel (following John D. Cook's c++ implementation of the log gamma function: https://www.johndcook.com/blog/cpp_gamma/). For the backward pass op, I've added a digamma kernel that follows the cpu+cuda digamma implementation, and for the backward pass of the digamma op, I've added a polygamma + trigamma kernel following, again, the cpu+cuda implementations. NOTE: The cpu implementation of the polygamma function incorrectly (as far as I can tell) outputs a finite number for order = 1 and x in the negative integers. The mps implementation correctly outputs infinite. (see https://github.com/pytorch/pytorch/issues/106692) The polygamma tests currently don't pass because of the error in the cpu+cuda kernels, but also because there are smallish discrepancies near the negative integers between the cpu+cuda and the mps polygamma and trigamma kernels. I'm not sure exactly why this is, but let me know if the discrepancies are too big. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106292 Approved by: https://github.com/kulinseth	2023-09-12 16:43:37 +00:00
Li-Huai (Allan) Lin	293d3b89d8	Add Opinfos for the Tensor overload of linspace/logspace (#107958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107958 Approved by: https://github.com/zou3519	2023-09-11 22:30:19 +00:00
Nikita Shulga	9b12a28d89	[MPS] Implement `mul` operation for complex types (#108395 ) Using existing BinaryKernel template Add `mul` as well as `kron` and `outer` to list of MPS ops that support complex types This should add all the missing ops mentioned in https://github.com/pytorch/pytorch/issues/105665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108395 Approved by: https://github.com/albanD ghstack dependencies: #108393, #108394	2023-09-10 05:39:12 +00:00
Nikita Shulga	c7bb842d35	[MPS] Add complex `add`/`sub` (#108394 ) Using `view_as_real` and running elementwise ops in resulted tensors Add `add` and `sub` to list of complex ops that should work on MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/108394 Approved by: https://github.com/albanD ghstack dependencies: #108393	2023-09-10 05:39:12 +00:00
Nikita Shulga	53a4ca4b58	[MPS][BE] Add `dispatch_sync_with_rethrow` (#108393 ) And enable testing for match_output for complex types. Most of them should throw an "unsupported XYZ" error, rather than crash. This fixed several crashes when linalg ops were invoked with complex inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108393 Approved by: https://github.com/kit1980, https://github.com/kulinseth	2023-09-10 02:07:12 +00:00
alexdremov	b60273b88a	[MPS] Pixel shuffle unshuffle support (#99306 ) Fixes #83196 Now, MPS implementation is blazingly fast. Though, I have several questions on improving this PR: 1. I copied code from `test_nn.py`. Is there better way to test this? 2. I decided to use `usepixelshuffleorder:YES`. Am I right performance-wise? According to docs: ``` `usePixelShuffleOrder` can be used to control how the data within spatial blocks is ordered in the `depthAxis` dimension: with `usePixelShuffleOrder=YES` the values within the spatial blocks are stored contiguosly within the `depthAxis` dimension whereas otherwise they are stored interleaved with existing values in the `depthAxis` dimension. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99306 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-09-06 09:11:39 +00:00
CaoE	42f94d7e9f	add Half support for maxpool on CPU (#98819 ) ### Testing Single socket (28 cores): shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 4.12895 \| 6.9669 \| 5.30297 \| 0.55775 \| 1.98917 \| 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 0.85093 \| 1.88813 \| 1.38063 \| 5.5742 \| 36.5086 \| 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 22.37212 \| 37.90383 \| 30.94482 \| 6.85868 \| 10.6116 \| 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 5.41658 \| 4.71098 \| 4.66578 \| 6.69875 \| 14.7171 \| 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 10.69831 \| 18.0468 \| 13.71657 \| 2.61192 \| 4.96172 \| 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.52637 \| 2.0096 \| 2.0055 \| 2.60314 \| 7.2093 \| 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.47605 \| 0.88398 \| 0.65326 \| 0.06525 \| 0.115489 \| 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.10902 \| 0.25293 \| 0.157475 \| 0.11386 \| 0.53319 \| 0.17836 Single core: shape \| fp32 forward / ms \| fp16 forward / ms \| bf16 forward / ms \| fp32 backward / ms \| fp16 backward / ms \| bf16 backward / ms -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 90.9809 \| 163.473 \| 126.1276 \| 6.57721 \| 41.40833 \| 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 9.88405 \| 38.39137 \| 29.62069 \| 7.10636 \| 36.97535 \| 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 476.782 \| 855.4769 \| 648.2248 \| 46.6488 \| 219.2586 \| 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 80.29271 \| 91.33854 \| 87.80345 \| 48.81692 \| 203.9974 \| 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 235.2113 \| 419.0799 \| 315.4284 \| 20.6049 \| 107.1524 \| 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 29.47653 \| 33.54905 \| 32.82823 \| 22.59674 \| 98.5586 \| 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 7.90684 \| 13.9208 \| 10.03272 \| 0.23725 \| 1.35269 \| 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.33638 \| 3.36894 \| 2.64635 \| 0.26535 \| 1.244 \| 0.38895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98819 Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki	2023-09-05 18:23:41 +00:00
Nikita Shulga	bae409388c	[MPS] Fix `.item()` for multi-dim scalar (#107913 ) By refactoring `_local_scalar_dense_mps` to use `_empty_like` to allocate CPU tensor. Also, print a more reasonable error message when dst dim is less than src in mps_copy_ This fixes regression introduced by https://github.com/pytorch/pytorch/pull/105617 and adds regression test. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at abd06e6</samp> > _Sing, O Muse, of the valiant deeds of the PyTorch developers_ > _Who strive to improve the performance and usability of tensors_ > _And who, with skill and wisdom, fixed a bug in the MPS backend_ > _That caused confusion and dismay to many a user of `item()`_ Fixes https://github.com/pytorch/pytorch/issues/107867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107913 Approved by: https://github.com/albanD	2023-08-31 21:08:29 +00:00
vfdev	b7624fc91e	Cleaned up test_mps.py::test_output*_match (#108092 ) Description: - cleaned up test_mps.py::test_output_match and test_mps.py::test_output_grad_match tests - removed unused variables and useless brackets - simplified atol/rtol setup if/else code Pull Request resolved: https://github.com/pytorch/pytorch/pull/108092 Approved by: https://github.com/kulinseth	2023-08-29 10:46:02 +00:00
Nikita Shulga	6e85a68829	[MPS] Implement `polar` via metal shader (#107324 ) Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator. Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block. Fixes https://github.com/pytorch/pytorch/issues/78503 TODOs(in followup PRs): - Implement backwards (requires complex mul and sgn) - Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR) Partially addresses https://github.com/pytorch/pytorch/issues/105665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324 Approved by: https://github.com/albanD	2023-08-25 03:16:23 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
arunppsg	4bfc55ba8b	[MPS] Enable forward test for renorm (#106666 ) Enabled forward test for renorm Pull Request resolved: https://github.com/pytorch/pytorch/pull/106666 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-08-17 16:46:06 +00:00
Jason Lu	bc88028e8e	Back out "Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 )" (#106743 ) Summary: Original commit changeset: 81319beb97f3 Original Phabricator Diff: D47961182 Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822 Reviewed By: atuljangra Differential Revision: D48131623 @diff-train-skip-merge (D48131623 landed internally) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743 Approved by: https://github.com/malfet	2023-08-08 15:27:34 +00:00
Ramin Azarmehr	cdfd0ea162	[MPS] Introduce torch.mps.Event() APIs (#102121 ) - Implement `MPSEventPool` to recycle events. - Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`. - Add API to measure elapsed time between two event recordings. - Added documentation for Event class to `mps.rst`. - Added test case to `test_mps.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-08-08 03:45:45 +00:00
Li-Huai (Allan) Lin	d4d086ce7b	[MPS] Fix Clamp with strided outputs/inputs (#97858 ) Fixes #94396 Fixes #87348 1. If output is strided, we don't gather input tensors. 2. If output is not strided but min_t or max_t is strided, we make min_t or max_t contiguous. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97858 Approved by: https://github.com/kulinseth	2023-08-04 09:32:12 +00:00
Peter Stefek	c9c2b14c53	Fix copy_ broadcast behavior on mps (#105617 ) Fixes #105277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105617 Approved by: https://github.com/malfet	2023-08-03 04:03:32 +00:00
PyTorch MergeBot	d83b887f2a	Revert "Add error checking for padding modules (#106147 )" This reverts commit `0547b6279d`. Reverted https://github.com/pytorch/pytorch/pull/106147 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal builds, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/106147#issuecomment-1661870970))	2023-08-02 09:37:40 +00:00
Denis Vieriu	d1a2aa1909	[MPS] Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747 ) - Fix the FP16 clamp issue (FP32 and FP16 are not broadcast compatible) - Fix clamp (cached graph nodes were previously replaced with the cast version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105747 Approved by: https://github.com/kulinseth	2023-08-02 02:51:34 +00:00
Peter Stefek	97e5055a69	Add cumprod support for device mps (#104688 ) Related to #77764 Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688 Approved by: https://github.com/kulinseth	2023-08-01 21:51:20 +00:00
Mikayla Gawarecki	0547b6279d	Add error checking for padding modules (#106147 ) Fixes https://github.com/pytorch/pytorch/issues/105627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106147 Approved by: https://github.com/albanD ghstack dependencies: #106325	2023-08-01 12:49:58 +00:00
Mikayla Gawarecki	d8e5f2aa6d	Reland "Make adding buffers more like adding parameters (#104069 )" (#106224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224 Approved by: https://github.com/atalman, https://github.com/albanD	2023-07-31 17:18:56 +00:00
cyy	b8eb827d93	use UBSAN on some tests (#103655 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/103655 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-07-24 14:24:49 +00:00
Peter Pham	bba06ad751	[MPS] aten::erfinv metal kernel ops (#101507 ) I've added the implementation of erfinv using the algorithm from `4154c8ea15/aten/src/ATen/native/Math.h (L152)` in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661 Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements. ``` import torch x = torch.arange(-1, 1, 1e-8) # default cpu tensor #measure CPU compute time by calling torch.erfinv time = %timeit -o -q -r 5 torch.erfinv(x) cpu_time = time.average print("CPU torch.erfinv time: ", cpu_time) x = x.to("mps") # measure MPS compute time time = %timeit -o -q -r 5 torch.erfinv(x) mps_time = time.average print("MPS torch.erfinv time: ", mps_time) print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv") # compute MSE between MPS and CPU torch.erfinv x = x.to("cpu") y_cpu = torch.erfinv(x) x = x.to("mps") y_mps = torch.erfinv(x) y_mps = y_mps.to("cpu") mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu")) y_mps = y_mps[mask] y_cpu = y_cpu[mask] x = x[mask] print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}") mse = torch.square(y_cpu - y_mps).mean() print("MSE between MPS and CPU torch.erfinv: ", mse) diff = torch.abs(y_cpu - y_mps) print("Largest difference") print(f"x: {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}") ``` CPU torch.erfinv time: 2.654937833400254 MPS torch.erfinv time: 0.009831255332002912 MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992 MSE between MPS and CPU torch.erfinv: tensor(4.2339e-14) Largest difference x: -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06 Fixes #https://github.com/pytorch/pytorch/issues/86808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507 Approved by: https://github.com/kulinseth	2023-07-23 01:36:43 +00:00
Jane Xu	803d42e457	add lerp cpu support for half (#105607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607 Approved by: https://github.com/albanD	2023-07-21 20:29:05 +00:00
Andrey Talman	c6653b65d8	Back out "Make adding buffers more like adding parameters (#104069 )" (#105581 ) Summary: D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/ with `TypeError: register_buffer() takes 3 positional arguments but 4 were given` Original commit changeset: d4b4069fbd38 Original Phabricator Diff: D47537831 Test Plan: ``` buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform ``` Reviewed By: atalman Differential Revision: D47600140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581 Approved by: https://github.com/mikaylagawarecki	2023-07-20 03:39:53 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Peter Stefek	d2c24eca8a	Fix mps unary op issue on non densely stored tensors (#105512 ) This pr fixes a bug where non densely stored tensors were not converted to the dense tensors of the correct scalar type in the mps `unary_op` helper function Fixes https://github.com/pytorch/pytorch/issues/105284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105512 Approved by: https://github.com/malfet	2023-07-19 03:56:38 +00:00
Nikita Shulga	8cd94e1eab	[MPS] Add lerp implementation (#105470 ) lerp.Scalar fits very well into binary op template Add a very naive implementation for `lerp.Tensor` as `add_out(self, weights.mul(end.sub(self)))` Enable `lerp` testing in `test_mps` Fixes https://github.com/pytorch/pytorch/issues/105382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105470 Approved by: https://github.com/albanD	2023-07-18 20:01:04 +00:00
ekamiti	32d422f335	Make adding buffers more like adding parameters (#104069 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069 Approved by: https://github.com/mikaylagawarecki	2023-07-17 17:59:05 +00:00
David Radley	17250976f3	correct empty tensor mps all operation (#105218 ) Fixes #104694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105218 Approved by: https://github.com/ezyang, https://github.com/kulinseth	2023-07-14 17:42:54 +00:00
albanD	08cbfb2a58	Avoid tensor creation and use scalar overload (#104264 ) I would expect this preserves the behavior but there might be weird edge cases? @mruberry might know? The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264 Approved by: https://github.com/zou3519	2023-07-12 18:11:27 +00:00
Nikita Shulga	5e4ee15e85	[MPS] Fix unique flatten logic (#104938 ) Tensor must be flatted if dim is none before checking whether or not dim dimension is already None Fixes https://github.com/pytorch/pytorch/issues/104879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104938 Approved by: https://github.com/albanD	2023-07-11 19:55:56 +00:00
soulitzer	91dcc3b272	Fix activation checkpoint for mps (#104787 ) Fixes https://github.com/pytorch/pytorch/issues/104478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104787 Approved by: https://github.com/albanD	2023-07-08 14:57:05 +00:00
Jerry Zhang	611febf6cf	[quant] Support integer implementations for max_pool2d (#104225 ) Summary: This is needed for representing quantized model in pt2 export quantization flow Test Plan: tested by opinfo, python test/test_ops.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225 Approved by: https://github.com/kimishpatel	2023-07-05 23:54:07 +00:00
Nikita Shulga	01e6d64dd2	[MPS] Fix unary ops over sparse-mapped tensors (#100765 ) If input tensor is backed by a sparse view, create a dense copy before running unary op, otherwise op will be applied against the wrong elements. Introduce `is_dense_in_storage` that returns true if tensor/view are mapped to a dense area in the tensor storage. Add unit test to validate the fix. Fixes https://github.com/pytorch/pytorch/issues/98074 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100765 Approved by: https://github.com/albanD	2023-07-05 23:17:43 +00:00
Denis Vieriu	28720ad585	Fix argmax and argmin clamp value on MPS (#104374 ) Replace clamp `LLONG_MAX` clamp value with the largest integer value that can be stored in a double. `constantWithScalar` takes as input a `double` value, for which `LLONG_MAX` was not fitting in a dobule, resulting in failures on x86. Fixes https://github.com/pytorch/pytorch/issues/98191, https://github.com/pytorch/pytorch/issues/92311 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104374 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-06-30 18:11:49 +00:00
cyy	54cb61f7d9	enable ASAN on some tests (#103647 ) Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang. The following cited doc explains the reasons. ``` -fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types which would overflow the destination. Because the range of representable values for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are conversions from floating point to integer types. -fsanitize=float-divide-by-zero: Floating point division by zero. This is undefined per the C and C++ standards, but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing either an infinity or NaN value, so is not included in -fsanitize=undefined. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647 Approved by: https://github.com/kit1980	2023-06-28 02:17:14 +00:00
magic-akari	e56cdfd74b	[MPS] Handle deserialization more permissively (#98834 ) MPS deserialization should handle `mps:0`. It is generated from some codes like the following ```python torch.rand(size=(3, 4)).to("mps") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98834 Approved by: https://github.com/kulinseth, https://github.com/kit1980, https://github.com/malfet	2023-06-15 15:51:03 +00:00
Pearu Peterson	45401ef745	Enable float16 and complex32 support for sparse CSR elementwise multiplication operation. (#100394 ) As in the title. In addition, the PR adds float16 addcmul support for CPU device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100394 Approved by: https://github.com/amjames, https://github.com/cpuhrsch	2023-06-14 14:42:39 +00:00
Li-Huai (Allan) Lin	cce58a43c9	[MPS] Fix softplus with f16 input (#101948 ) Fixes #101946 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101948 Approved by: https://github.com/malfet	2023-05-31 00:40:10 +00:00
ecao	3f4fee735a	add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-27 16:20:21 +00:00
Li-Huai (Allan) Lin	0db704d240	[OpInfo] Add multi_head_attention_forward (#100153 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 8f8d620</samp> This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153 Approved by: https://github.com/drisspg	2023-05-26 01:58:17 +00:00
Denis Vieriu	de7ec2ddd7	[MPS] Allow saved models to be loaded directly to MPS through torch.jit.load (#102204 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 94eed69</samp> This pull request adds support for serializing and deserializing tensors on the `mps` device using JIT. It includes a test case in `test/test_mps.py` and a device handling logic in `torch/csrc/jit/serialization/unpickler.cpp`. Fixes https://github.com/pytorch/pytorch/issues/88820, https://github.com/pytorch/pytorch/issues/87504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102204 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-25 23:32:29 +00:00
Li-Huai (Allan) Lin	02a7318a5b	[MPS] Add aminmax op (#101691 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101691 Approved by: https://github.com/malfet	2023-05-23 18:01:34 +00:00
Li-Huai (Allan) Lin	330c907301	[MPS] Fix embedding cache key (#101857 ) Fixes #101198 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101857 Approved by: https://github.com/kulinseth	2023-05-21 06:11:25 +00:00
Aaron Gokaslan	3e2ea32dab	[BE]: Enable ruff rule TRY302 and apply fixes (#101874 ) Removes useless try statements and unreachable code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874 Approved by: https://github.com/malfet	2023-05-19 17:30:52 +00:00
Khushi	1aaf0396eb	[reland][opinfo] empty_strided (#101782 ) Follows #100223 Previous PR: #100890 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101782 Approved by: https://github.com/ezyang	2023-05-19 03:06:29 +00:00
PyTorch MergeBot	dfac4364c4	Revert "[opinfo] empty_strided (#100890 )" This reverts commit `01c7106580`. Reverted https://github.com/pytorch/pytorch/pull/100890 on behalf of https://github.com/PaliC due to broke test_ops.py slow test ([comment](https://github.com/pytorch/pytorch/pull/100890#issuecomment-1551903975))	2023-05-17 19:00:15 +00:00
Li-Huai (Allan) Lin	bb3558961f	[MPS] Add histogram ops (#96652 ) Adds `torch.histc`, `torch.histogram`, `torch.histogramdd` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96652 Approved by: https://github.com/kulinseth, https://github.com/malfet	2023-05-17 01:25:43 +00:00
Khushi	01c7106580	[opinfo] empty_strided (#100890 ) Follows: #100223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100890 Approved by: https://github.com/ezyang	2023-05-15 23:39:39 +00:00
Nikita Shulga	9e089db32e	[MPS] Enable `arange` for `int8` and `uint8` dtypes (#101303 ) Not sure, why it was not enabled previously. Sort types in `AT_DISPATCH_MPS_TYPES` by group (floats first then integers) and size. Test implicitly in `test_bernoulli`. <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 80c7ed7</samp> > _`Char` and `Byte` types_ > _MPS can dispatch them now_ > _Winter of tensors_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/101303 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/kulinseth	2023-05-13 01:19:08 +00:00
Ramin Azarmehr	0be53d83fc	[MPS] Add support for MPSProfiler Python bindings (#101002 ) - Added torch.mps.profiler.[start() and stop()] APIs with RST documentation - Added test case in test_mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/101002 Approved by: https://github.com/malfet	2023-05-12 21:55:34 +00:00
Sun, Jiayi	d56e1b2f67	add Half support for unary ops on CPU (#98493 ) Add Half support for log_sigmoid and some unary ops on CPU, including sinc, acosh, asinh, atanh, digamma, trigamma, rsqrt, acos, asin, atan, ceil, cos, erf, erfc, erfinv, exp, expml, floor, log, log10, log1p, log2, i0, round, sin, sqrt, tan, tanh, trunc, lgamma. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98493 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel	2023-05-12 04:52:34 +00:00
Nikita Shulga	b7bf953bbc	[MPS] Fix bernoulli for int types (#100946 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 069fd23</samp> This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch. Fixes https://github.com/pytorch/pytorch/issues/100717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946 Approved by: https://github.com/kulinseth	2023-05-11 23:52:38 +00:00
Nikita Shulga	87084643e5	[CI][MPS] Actually make grid_sampler_2d available (#101108 ) In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value. MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s). Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108 Approved by: https://github.com/kulinseth	2023-05-11 10:35:09 +00:00
Khushi	51fe53e619	[opinfo] item (#100313 ) Follows #100223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100313 Approved by: https://github.com/ezyang	2023-05-10 11:32:45 +00:00
Ramin Azarmehr	cecfcf1e17	[MPS] Handle MPS failures of test_modules.py in common_modules.py (#95334 ) - Also cleaned up `test_modules.py` from skipMPS code. - Added `skipMPS` for unsupported or failing tests on MPS backend in common_modules.py. (We'll remove `skipMPS` from those tests once a fix is available for them.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95334 Approved by: https://github.com/kulinseth, https://github.com/albanD	2023-05-09 03:55:16 +00:00
Li-Huai (Allan) Lin	3b6a7f4d51	[MPS] Fix index_put with deterministic algorithm enabled (#97660 ) Prevent using parallel computing when deterministic algorithm is set. Fixes #97574 Benchmark: ``` [--------------- index_put_ Deterministic Algorithm Enabled ---------------] \| cpu \| mps 1 threads: ----------------------------------------------------------------- Dtype: torch.float32 Features: 1024; Num Indices: 512 \| 37 \| 49 Dtype: torch.float32 Features: 1024; Num Indices: 1024 \| 54 \| 50 Dtype: torch.float32 Features: 1024; Num Indices: 2048 \| 86 \| 50 Dtype: torch.float32 Features: 1024; Num Indices: 4096 \| 150 \| 49 Times are in microseconds (us). [-------------- index_put_ Deterministic Algorithm Disabled ---------------] \| cpu \| mps 1 threads: ----------------------------------------------------------------- DType: torch.float32 Features: 1024; Num Indices: 512 \| 37 \| 49 DType: torch.float32 Features: 1024; Num Indices: 1024 \| 53 \| 49 DType: torch.float32 Features: 1024; Num Indices: 2048 \| 86 \| 49 DType: torch.float32 Features: 1024; Num Indices: 4096 \| 147 \| 50 Times are in microseconds (us). ``` <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at ebf2ff3</samp> Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660 Approved by: https://github.com/kulinseth	2023-05-08 00:57:29 +00:00
Kulin Seth	e20c94bda9	[MPS] Add the test for 5D in test_mps which is skipped. (#99271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99271 Approved by: https://github.com/DenisVieriu97	2023-05-05 22:57:06 +00:00
Li-Huai (Allan) Lin	13da6585b6	[MPS] Skip all empty ops tests (#100368 ) Fixes #100175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100368 Approved by: https://github.com/kulinseth	2023-05-02 00:43:58 +00:00
Li-Huai (Allan) Lin	a50fb50c51	[MPS] Fix exception regex not compared (#100367 ) Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367 Approved by: https://github.com/albanD	2023-05-02 00:43:58 +00:00
Nikita Shulga	2442858f52	[MPS] Fix `layer_norm_backward_mps` key (#100295 ) Followup after https://github.com/pytorch/pytorch/pull/98794 See report in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1527312211 and reproducer in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1528214175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100295 Approved by: https://github.com/kit1980, https://github.com/izaitsevfb	2023-04-29 03:37:35 +00:00
Li-Huai (Allan) Lin	81978120ec	[MPS] Fix trace exceptions not raised for error inputs (#99239 ) Also rename `trace_mps_out` to `trace_mps` as it is not an out version. Remove `index_add` from XFAILLIST as it seems working as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99239 Approved by: https://github.com/kulinseth	2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin	f4a37c9a5d	[MPS] Fix max_pool2d exceptions not raised for error inputs (#99238 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99238 Approved by: https://github.com/kulinseth	2023-04-26 14:41:50 +00:00
Li-Huai (Allan) Lin	f4cf744380	[MPS] Fix gelu exceptions not raised for error inputs (#99237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99237 Approved by: https://github.com/kulinseth	2023-04-26 14:41:46 +00:00
Li-Huai (Allan) Lin	1fcf40da63	[MPS] Add linear inputs check (#99228 ) Fixes #98211 https://github.com/pytorch/pytorch/issues/98211#issuecomment-1496005668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99228 Approved by: https://github.com/kit1980	2023-04-26 04:44:23 +00:00
Denis Vieriu	89baa1a74c	[MPS] Add support for linalg.vector_norm (#99811 ) Summary of changes: - Add support for linalg.vector_norm - Fix zero norm, correct formula is: sum(x != 0) - Add additional tests in test_mps Pull Request resolved: https://github.com/pytorch/pytorch/pull/99811 Approved by: https://github.com/kulinseth	2023-04-26 01:34:29 +00:00
Justin Chu	79c9e82e27	Fix flake8 lint errors reported by ruff - take 2 (#99798 ) Replaces #99784. This PR is pure autofix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99798 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2023-04-23 23:09:51 +00:00
BJ Hargrave	dc52ba2906	Fix test_mps for macos 13.3 (#98739 ) Expected dtype is changed from torch.int64 to torch.int32 prior to macos 13.3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98739 Approved by: https://github.com/kulinseth	2023-04-12 19:23:08 +00:00
Li-Huai (Allan) Lin	be8a4eb8e3	[MPS] Add index_fill op (#98694 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98694 Approved by: https://github.com/kulinseth	2023-04-12 18:13:33 +00:00
Li-Huai (Allan) Lin	71aea7f56e	[MPS] Add error inputs check (#98167 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/98167 Approved by: https://github.com/kulinseth	2023-04-12 17:19:13 +00:00
Nikita Shulga	583193e1d9	[MPS] Fix batch_norm_backwards key (#98794 ) One needs different graphs for batch_norm_backwards depending whether or not gradients are required for some of the params Fixes https://github.com/pytorch/pytorch/issues/98602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98794 Approved by: https://github.com/kulinseth	2023-04-11 17:23:36 +00:00
Guang Yang	c377a8590b	Add `nonzero_static()` op to pytorch to unblock export (#97417 ) Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR Example: Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])` call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])` call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded) call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated) Test Plan: Unit Tests ``` buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static' ``` PT2 Export with `nonzero_static()` Example of `GraphModule` in the exported graph ``` def forward(self, x): arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4); arg0 = None return pytree.tree_unflatten([nonzero_static_default], self._out_spec) ``` Differential Revision: D44324808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417 Approved by: https://github.com/ezyang	2023-04-11 05:13:36 +00:00
Nikita Shulga	29cde00701	[MPS] Add `random_` overload (#98333 ) That simply calls `torch.random_(from=0, to=None)` Also, fix optional upper bound calculation for all `dtypes` but int64: As one can see from https://pytorch.org/docs/stable/generated/torch.Tensor.random_.html `from` boundary is inclusive, but `to` is exclusive, i.e. if `to` is omitted for `torch.int8` dtype, it should be set to `128` and to `2` for torch.bool. Add test for `torch.random_` Fixes https://github.com/pytorch/pytorch/issues/98118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98333 Approved by: https://github.com/kulinseth	2023-04-05 21:24:45 +00:00
Li-Huai (Allan) Lin	db8abde9b6	[MPS] Enable conditional indexing tests (#97871 ) The tests seem to be working now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871 Approved by: https://github.com/kulinseth	2023-04-01 16:15:08 +00:00
Li-Huai (Allan) Lin	7776653a0c	Add linear gradgrad (#97151 ) Fixes #92206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97151 Approved by: https://github.com/albanD	2023-03-30 07:25:02 +00:00
Philip Meier	2f6c18d1a2	improve memory footprint of torch.testing.assert_close (#96131 ) Redo of #90172 out of stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96131 Approved by: https://github.com/pearu, https://github.com/mruberry	2023-03-29 23:49:56 +00:00
Li-Huai (Allan) Lin	4afef85dda	[MPS] Fix index_select_scalar test (#97773 ) #96408 introduced a check that prevents the index to scalar from being non-singleton. Fixes #94162 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97773 Approved by: https://github.com/kulinseth	2023-03-28 19:23:59 +00:00
Li-Huai (Allan) Lin	100641aadf	[MPS] Fix torch.eye unsupported bool constant on macOS 12 (#97027 ) Fixes #91620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97027 Approved by: https://github.com/kulinseth	2023-03-20 18:08:36 +00:00
Ramin Azarmehr	50beab2978	[MPS] Fix the failure with ReplicatePad3D (#96988 ) - Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad - Added a test case - The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988 Approved by: https://github.com/DenisVieriu97	2023-03-17 01:41:12 +00:00
alexdremov	62eb7a2e97	[MPS] LSTM grad_y missing fix (#96601 ) Fixes #96416 Added tests that do not use LSTM output simalarly to the issue Seems like this fix once again introduces backward incompatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/96601 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-03-16 15:53:56 +00:00
Li-Huai (Allan) Lin	c95bcb6694	[MPS] Fix flip where no dims need to be flipped (#96605 ) Fixes #96558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96605 Approved by: https://github.com/kulinseth	2023-03-14 00:34:30 +00:00
Li-Huai (Allan) Lin	a87f3f612e	[MPS] Fall back multi-layer LSTM on macOS 12 (#90909 ) The native implementation of LSTM has been fixed on macOS 13. On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side. Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable. Fixes: #90421 Issues related: #80306, #83144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909 Approved by: https://github.com/albanD, https://github.com/kulinseth	2023-03-10 03:10:49 +00:00
Nikita Shulga	075a49442d	[MPS] Allow `float16` input to float32 `LayerNorm` (#96430 ) Only for forward pass Subset of https://github.com/pytorch/pytorch/pull/96208 Create constant with scalar using `input_mps_dtype` and use `reciprocalWithTensor` instead of `divisionWithPrimaryTensor:1.0 secondaryTensor:` Fixes https://github.com/pytorch/pytorch/issues/96113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96430 Approved by: https://github.com/kulinseth	2023-03-09 22:09:10 +00:00
Kulin Seth	2bb022e902	[MPS] Adding xfaillist with all categories of failures. (#96176 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/96176 Approved by: https://github.com/malfet	2023-03-08 08:41:21 +00:00
Catherine Lee	eea0733045	Reduce pytest blocklist (#96016 ) `TestCase = object` or variations of it get switched to `TestCase = NoTest`. unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection. pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn	2023-03-07 18:30:27 +00:00
Li-Huai (Allan) Lin	2f66b57a7a	[MPS] Fix in-place add and sub with alpha == 0.0 (#96184 ) Apart from fixing the below issue, this PR integrates the test for `sub` into the test for `add` as they are implemented using the same template. Fixes #96065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96184 Approved by: https://github.com/kulinseth	2023-03-07 17:17:53 +00:00
Nikita Shulga	769cc8a614	[MPS] Add type promotion to `torch.addcmul` (#96164 ) Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"` Modify `castMPSTensor` to become a no-op if cast is not needed Define `common_dtype` as `c10::promoType` between self, tensor1 and tensor2. Cast to any output type. Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover all the permutations Discovered while looking at https://github.com/pytorch/pytorch/issues/96113 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164 Approved by: https://github.com/kulinseth	2023-03-07 04:19:30 +00:00
alexdremov	78da315afd	[MPS] Fix bidirectional LSTM & small one-direction LSTM fix (#95563 ) Fixes #94754 With this PR I hope to finish my breathtaking journey of fixing MPS LSTM. Here, I enable `bidirectional` on MPS. Also, I've noticed that cache key did not account for all parameters, so there could have been problems with one-directional LSTM when created without bias or dropout and then with one of them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95563 Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/malfet	2023-03-05 00:19:54 +00:00
Nikita Shulga	436993d52b	[MPS] Error on unsupported types (#95982 ) I.e. attempt to create tensor of all possible types and make sure that it raises a structured error for non-MPS types Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before) Fixes https://github.com/pytorch/pytorch/issues/95976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982 Approved by: https://github.com/kulinseth	2023-03-04 01:29:07 +00:00
Denis Vieriu	304a95435d	[MPS] Disallow reshape in slice (#95905 ) Disallow reshapes for arrayViews. Current code allows a base shape of `[2, 4, 256]` to be sliced into `[4, 1, 256]` (view's shape) - which is not possible. Slicing a smaller dimension into a bigger one will always error out. Fixes https://github.com/pytorch/pytorch/issues/95883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95905 Approved by: https://github.com/razarmehr, https://github.com/kulinseth	2023-03-03 08:08:34 +00:00
Denis Vieriu	d0dd898943	[MPS] Remove remaining casts from 13.3 (#95870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95870 Approved by: https://github.com/kulinseth	2023-03-02 12:44:59 +00:00
Denis Vieriu	4d3352ed90	[MPS] Remove casts from reduction/cumsum/sort ops starting with macOS 13.3 (#95817 ) MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817 Approved by: https://github.com/kulinseth	2023-03-02 00:26:24 +00:00
Kulin Seth	5d9d8c6154	[MPS] Add fixes for div with floor and raise error for div_trunc (#95769 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95769 Approved by: https://github.com/DenisVieriu97	2023-03-01 20:52:28 +00:00
Denis Vieriu	e5a959a2d4	[MPS] Fix views with 3 or more sliced dimensions (#95762 ) Fixes https://github.com/pytorch/pytorch/issues/95482 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95762 Approved by: https://github.com/razarmehr	2023-03-01 16:16:49 +00:00
Denis Vieriu	ed1957dc19	[MPS] Add support for masked_scatter (#95743 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95743 Approved by: https://github.com/kulinseth	2023-03-01 01:36:36 +00:00
Li-Huai (Allan) Lin	f33180fb7f	[MPS] Add pow.Scalar (#95201 ) 1. Adds `pow.Scalar`. 2. Modifies testing `atol` and `rtol` to get pow output match tests pass. 3. Xfails numerically incorrect dtypes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95201 Approved by: https://github.com/kulinseth	2023-02-28 16:11:15 +00:00
Li-Huai (Allan) Lin	9e16f1281f	[MPS] Add copysign op. (#95552 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/95552 Approved by: https://github.com/kulinseth	2023-02-28 06:49:46 +00:00
Li-Huai (Allan) Lin	b7c2a65139	[MPS] Fix type casting copy with storage offset (#95573 ) This PR handles the case where the `dst` tensor of type casting has a storage offset by creating a temporary buffer to store results and then copy them back to the dst with the offset added. Fixes #95417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95573 Approved by: https://github.com/kulinseth	2023-02-28 05:24:31 +00:00
Li-Huai (Allan) Lin	4930ae7f82	[MPS] Add roll op (#95168 ) Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know). Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS. Did a small benchmark: ```python for num in [10, 100, 1000, 10000]: for shft in [1, 5]: sz = num * num x = torch.arange(sz, device="cpu").view(num, num) s = time.time() r = torch.roll(x, shft) cpu_e = time.time() - s x = torch.arange(sz, device="mps").view(num, num) s = time.time() r = torch.roll(x, shft) mps_e = time.time() - s print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}") ``` ``` size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578 size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172 size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172 size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375 size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938 size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203 size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945 size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168 Approved by: https://github.com/albanD	2023-02-27 18:31:17 +00:00

... 3 4 5 6 7 ...

680 Commits