pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Isalia20	b75afa2e2e	[MPS] cholesky implementation (#145701 ) Requested in #77764 Closed #144193 due to a lot of conflicts when rebasing Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701 Approved by: https://github.com/malfet	2025-01-27 01:53:03 +00:00
Nikita Shulga	3cf7874ebe	[MPS][BE] Implement bilineard2d as shader (#145581 ) That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation `569b785371/src/libImaging/Resample.c (L306-L309)` I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results. But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation of those values as floats and doubles. Changes in the performance could be measured using the following script ```python import torch import time import subprocess def benchmark(device, dtype): # Create example inputs x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype) sf = .5 # Check output y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear") outputs_match = torch.allclose(y.cpu(), z) if not outputs_match: atol = (y.cpu() - z).abs().max() rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max() print(f"atol={atol} rtol={rtol}") # Measure time manually start_time = time.time() * 1000 for _ in range(1000): y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear") torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return "True " if outputs_match else "False", average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip() print(f"\nBenchmarking Results (collected on {brand_string}):") print("-"40) print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) ``` Benchmark results before ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| False \| True \| True \| True \| True Average Time (us) : 277.3 \| 197.2 \| 188.0 \| 163.5 \| 302.8 \| 248.1 \| 308.7 \| 650.9 ``` After(almost 100x* perf gain): ``` Benchmarking Results (collected on Apple M4 Pro): ---------------------------------------- Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| U8 \| FP32 \| FP16 \| BF16 \| U8 Outputs Match : True \| True \| True \| True \| True \| True \| True \| True Average Time (us) : 1.7 \| 1.5 \| 1.7 \| 1.5 \| 296.5 \| 236.0 \| 310.8 \| 642.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581 Approved by: https://github.com/Skylion007 ghstack dependencies: #145578	2025-01-25 21:09:46 +00:00
soulitzer	3a3e2cf90a	Remove det_singular OpInfo (#145533 ) Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044 From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this. Some older attempts: - https://github.com/pytorch/pytorch/pull/102581 - https://github.com/pytorch/pytorch/pull/109249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533 Approved by: https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #145520, #145531	2025-01-25 00:58:03 +00:00
Nikita Shulga	dc9b77cc55	[MPS] Support includes in metal objects (#145087 ) Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation. Test using: - `TestMetalLibrary.test_metal_include` - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact: ```metal template <typename T, typename Tout = T> void kernel i0(constant T* input, device Tout* output, uint index [[thread_position_in_grid]]) { output[index] = c10::i0(static_cast<Tout>(input[index])); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087 Approved by: https://github.com/dcci ghstack dependencies: #145023	2025-01-18 05:35:22 +00:00
Tom Ritchford	46fbd63405	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2025-01-17 18:21:22 +00:00
PyTorch MergeBot	46b92c025d	Revert "Cholesky mps implementation (#144193 )" This reverts commit `727ae13318`. Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see `aa4a1ff027/1` ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))	2025-01-16 21:37:32 +00:00
Isalia20	727ae13318	Cholesky mps implementation (#144193 ) Requested in #77764 PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks: - [x] Make `upper=True` work, only `upper=False` works now - [x] Code cleanup - [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out) - [x] Checks for positive definite input - [x] Support for (*, N, N) input, currently only supports (B, N, N) input - [x] Support other dtypes(float16, bfloat16) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-16 16:26:46 +00:00
Nikita Shulga	46eeef9130	[MPS][BE] Surface syntax errors shader compilation (#144648 ) Before this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ " UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ } ``` After this change ```python >>> import torch >>> torch.mps._compile_shader('What') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader return torch._C._mps_compileShader(source) SyntaxError: program_source:1:1: error: unknown type name 'What' What ^ program_source:1:5: error: expected unqualified-id What ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648 Approved by: https://github.com/Skylion007 ghstack dependencies: #144647	2025-01-13 02:03:19 +00:00
Nikita Shulga	92ddb3d3d3	[MPS] Expose `MPSProfiler::start/stopCapture` to Python (#144561 ) I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager. For example, code below: ```python import torch import os def foo(x): return x[:,::2].sin() + x[:, 1::2].cos() if __name__ == "__main__": os.environ["MTL_CAPTURE_ENABLED"] = "1" x = torch.rand(32, 1024, device="mps") with torch.mps.profiler.metal_capture("compiled_shader"): torch.compile(foo)(x) ``` should capture the execution of a `torch.compile` generated shader <img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561 Approved by: https://github.com/manuelcandales ghstack dependencies: #144559, #144560	2025-01-11 02:05:36 +00:00
Nikita Shulga	e56768f030	[MPS] Fix bitwise shifts for uint8 (#144251 ) Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function Fixes https://github.com/pytorch/pytorch/issues/144190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249, #144250	2025-01-06 18:27:16 +00:00
Nikita Shulga	ebeb433e73	[BE] Fix + parametrize `test_min_max_nan_propagation` (#144250 ) - `dtype` was not passed as argument to `torch.rand` before - Condition bfloat16 testing on MacOS14+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250 Approved by: https://github.com/Skylion007 ghstack dependencies: #144249	2025-01-06 17:49:41 +00:00
Nikita Shulga	11a0663eeb	[BE] Parametrize `test_min_max` (#144249 ) It's better to have one unit test per dtype rather a combined one Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249 Approved by: https://github.com/Skylion007	2025-01-06 17:49:41 +00:00
Davide Italiano	0dc1e6be19	[mps/BE] Fix linter warning/advice. (#144199 ) Two spaces before an inline comment according to E261. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 20:15:41 +00:00
Joona Havukainen	811c714911	Fix nan propagation for minimum() and maximum() in MPS (#144086 ) Fixes #143976 - Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one. - Adds test for the NaN propagating case to `test_mps.py`. - Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation. Additions by @malfet: - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <nshulga@meta.com>	2025-01-04 18:48:24 +00:00
Steven Zeltmann	6f2451c2e9	[MPS] Add `aten::angle` (#143449 ) This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449 Approved by: https://github.com/malfet	2025-01-04 15:38:40 +00:00
Nikita Shulga	301c457032	[MPS] Fix `nllnd_loss_backward` crash with different dtypes (#144170 ) Otherwise, invoking with torch.half inputs, but float weights will result in ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<xf32> 2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification' /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification' ``` Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170 Approved by: https://github.com/kit1980, https://github.com/Skylion007 ghstack dependencies: #144167, #144162, #144083, #144084	2025-01-04 15:24:55 +00:00
isalia20	22580f160e	Multinomial sampling fix on mps for non contiguous tensors (#141515 ) Fixes #141457 As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-04 01:21:37 +00:00
Nikita Shulga	a93e75d1e2	[MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055 ) Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463 failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055 Approved by: https://github.com/Skylion007	2025-01-02 17:12:39 +00:00
Nikita Shulga	c27c788e35	[MPS] Fix `torch.add(x,y, alpha=2)` crash (#143949 ) TODO: as followup PR replace this weird logic with shaders Fixes https://github.com/pytorch/pytorch/issues/143932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949 Approved by: https://github.com/Skylion007 ghstack dependencies: #143948	2024-12-30 17:16:29 +00:00
Nikita Shulga	3054aae493	[MPS] Fix fmin/fmax for scalar argument (#143934 ) CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash) Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934 Approved by: https://github.com/Skylion007	2024-12-28 17:07:19 +00:00
Joona Havukainen	33c27be017	Workaround for gather_out in MPS backend (#135543 ) Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually. Fixes https://github.com/pytorch/pytorch/issues/135240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-19 18:01:01 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Nikita Shulga	24a18d76c8	[MPS] Use metal shaders for all view ops (#143375 ) Before this PR Metal shaders were used to scatter/gather 1-5 dimensional tensors. This PR introduces generalized ones that could be used for any dimensionality and as results gets rid of 700+ lines complex and untested code that might not even work as expected. Generalized gather shader looks as follows ```metal kernel void gather_kernel_n(uint linear_index [[thread_position_in_grid]], constant void * src_ [[buffer(0)]], device void * dst_ [[buffer(1)]], constant uint32_t * size [[buffer(2)]], constant uint32_t * stride [[buffer(3)]], constant uint32_t & numel [[buffer(4)]], constant int32_t & ndim [[buffer(5)]]) {{ if (linear_index >= numel) return; constant {0} * src = (constant {0} )src_; device {1} dst = (device {1} )dst_; uint64_t src_offs = 0; auto src_idx = linear_index; for(int dim = ndim - 1; dim >= 0; --dim) {{ src_offs += stride[dim] (src_idx % size[dim]); src_idx /= size[dim]; }} dst[linear_index] = cast<{1}>(src[src_offs]); }} ``` Which, according to the following benchmark ```python from timeit import default_timer import torch import torch.utils.cpp_extension from torch.utils.benchmark import Measurement, Timer t = Timer( stmt=f"y.copy_(x);torch.mps.synchronize()", setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)", language="python", timer=default_timer ) print(t.blocked_autorange()) ``` Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below ```swift import Metal import MetalPerformanceShadersGraph func gatherComplexMPS(device: MTLDevice, inp_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: T.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4) ``` Fixes https://github.com/pytorch/pytorch/issues/143140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375 Approved by: https://github.com/albanD	2024-12-18 16:15:46 +00:00
Joona Havukainen	afa313e669	Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095 ) The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250). Fixes #141909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095 Approved by: https://github.com/kulinseth	2024-12-17 16:03:46 +00:00
Yu, Guangye	c1d4d9d3cf	[MPS] Support torch.accelerator.synchronize() on mps (#143171 ) # Motivation Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return. # Additional Context Add a mps UT to test code change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171 Approved by: https://github.com/albanD	2024-12-16 02:18:32 +00:00
Nikita Shulga	8a04018329	[MPS] Fix conv backward for channels last (cont) (#143196 ) This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input. Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea Fixes https://github.com/pytorch/pytorch/issues/142344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196 Approved by: https://github.com/manuelcandales	2024-12-13 21:32:42 +00:00
George Wigley	e0c8abda76	Fix potentially undefined behaviour in index_put sample input (#143116 ) From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_: > If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements. Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`. This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116 Approved by: https://github.com/albanD	2024-12-13 17:59:01 +00:00
Nikita Shulga	95b17f6346	[MPS] Add CompileShader method (#141478 ) This allows one to do something like that ```python import torch x = torch.ones(10, device="mps") m = torch.mps._compile_shader(""" kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) { x[idx] += idx; } ") m.foo(x) ``` And in general enables writing custom operators using Metal shaders purely in Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478 Approved by: https://github.com/manuelcandales	2024-12-11 02:00:51 +00:00
PyTorch MergeBot	393cf46f42	Revert "[MPS] Add CompileShader method (#141478 )" This reverts commit `0478fee42d`. Reverted https://github.com/pytorch/pytorch/pull/141478 on behalf of https://github.com/malfet due to Broke doctests, by trying to run MPS example on Linux ([comment](https://github.com/pytorch/pytorch/pull/141478#issuecomment-2533351909))	2024-12-11 00:37:10 +00:00
Nikita Shulga	0478fee42d	[MPS] Add CompileShader method (#141478 ) This allows one to do something like that ```python import torch x = torch.ones(10, device="mps") m = torch.mps._compile_shader(""" kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) { x[idx] += idx; } ") m.foo(x) ``` And in general enables writing custom operators using Metal shaders purely in Python Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478 Approved by: https://github.com/manuelcandales	2024-12-10 22:43:17 +00:00
Yu, Guangye	bee445c3a3	[MPS] Support torch.Event for MPS (#142468 ) # Motivation Support `torch.Event` on mps backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142468 Approved by: https://github.com/malfet	2024-12-10 21:17:25 +00:00
Nikita Shulga	d6481333ad	[MPS] Add scatter_reduce.two (#141948 ) Which has been request 20+ times on https://github.com/pytorch/pytorch/issues/77764 is just a flavor of out-of-box scatter-reduce, so all this op does is redispatches existing implementation. Unsupported dtype/reduction type combinations: - min/max for int64 - min/max for int32 on MacOS-14 or older Following swift code demonstrates problem with scatterAlongAxis MPS call ```swift import Metal import MetalPerformanceShadersGraph func scatterMPS(device: MTLDevice, inp_buf: MTLBuffer, upd_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .int64, name: nil) let updatesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.scatterAlongAxis(0, data: inputPlaceholder, updates: updatesPlaceholder, indices: indicesPlaceholder, mode: .min, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .int64) let mpsUpdatesBuffer = MPSGraphTensorData(upd_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .int64) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, updatesPlaceholder: mpsUpdatesBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues(device: MTLDevice, values: [Int64]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: Int64.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_elem = 4 let upd_elem = 4 let inp_buf = makeBufferWithValues(device: device, values: [1, 2, 3, 4]) let upd_buf = makeBufferWithValues(device: device, values: [Int64.max - 1, Int64.max - 2 , Int64.max >> 16 , 11]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:inp_elem * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } scatterMPS(device: device, inp_buf: inp_buf, upd_buf: upd_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: inp_elem, upd_elem: upd_elem) let obuf_data = out_buf.contents().assumingMemoryBound(to: Int64.self) for i in 0..<inp_elem { print("out_buf[\(i)] = \(obuf_data[i])") } ``` that prints `4294967294, 4294967293, 4294967295, 4` instead of expected `1, 2, 3, 4` Where `torch.tensor([[1, 9223372036854775806], [2, 9223372036854775805], [3, 140737488355327], [4, 11]], dtype=torch.int64, device='mps').max(1)` yields an expected results Pull Request resolved: https://github.com/pytorch/pytorch/pull/141948 Approved by: https://github.com/manuelcandales	2024-12-04 04:56:43 +00:00
Roy Hvaara	90f19fee8a	[MPS] Convert `channels_last_3d` to `contiguous` for input tensor in `nn.Conv3d` (#141780 ) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes #141471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780 Approved by: https://github.com/malfet	2024-12-01 18:36:53 +00:00
Roy Hvaara	4d5c096a55	[MPS] Add autocast rule for SDPA (#141776 ) Fixes #141774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-29 03:34:03 +00:00
Nikita Shulga	65166d86a3	[MPS] Add regression test for sync deadlock (#141296 ) See https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492434870 Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object, _object) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141296 Approved by: https://github.com/huydhn	2024-11-22 00:56:33 +00:00
PyTorch MergeBot	5e54cf3687	Revert "Fix MPS synchronize by waiting for root buffer to complete (#140725 )" This reverts commit `9bc9d4cdb4`. Reverted https://github.com/pytorch/pytorch/pull/140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from https://github.com/pytorch/pytorch/pull/127242 ([comment](https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492416501))	2024-11-21 21:56:22 +00:00
Nikita Shulga	a8794fd7df	[MPS] Fix conv backward pass for channels last (#141009 ) Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one) This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation) Fixes https://github.com/pytorch/pytorch/issues/140902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009 Approved by: https://github.com/manuelcandales	2024-11-20 19:50:31 +00:00
Siddharth Kotapati	9bc9d4cdb4	Fix MPS synchronize by waiting for root buffer to complete (#140725 ) Makes https://github.com/pytorch/pytorch/issues/139550#issuecomment-2468860559 work Pull Request resolved: https://github.com/pytorch/pytorch/pull/140725 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-11-19 23:10:24 +00:00
Nikita Shulga	9c88b08ac9	[BE] Replace `skipIfMPS` with `expectedFailureMPS` (#139940 ) Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed. - Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION` - Introduce `skipIfMPSOnMacOS13` to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon) - Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940 Approved by: https://github.com/janeyx99, https://github.com/huydhn	2024-11-15 03:48:37 +00:00
Roy Hvaara	b0d681417c	[MPS] Reintroduce support for convolutions with output_channels > 65536 (#140726 ) This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207. I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+. Fixes #140722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726 Approved by: https://github.com/malfet	2024-11-14 20:09:01 +00:00
Nikita Shulga	cd6ace1d15	[EZ] Delete unused `xfailIfMacOS14_4Plus` (#140735 ) Issue was fixed by https://github.com/pytorch/pytorch/pull/130038 but decorator remained in place Pull Request resolved: https://github.com/pytorch/pytorch/pull/140735 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-11-14 20:08:48 +00:00
Nikita Shulga	9d93c27025	Implement unfold_backward on MPS (#135411 ) This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true: * `out.shape[dim] == (in.shape[dim] - size) / step + 1` * `out.shape[-1] == size` * `out.ndim == in.ndim + 1` `unfold_backward` Metal kernel receives `grad_in` and returns `grad_out` such that: * `grad_in.shape == out.shape` * `grad_out.shape == in.shape` For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads. That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range. This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411 Approved by: https://github.com/manuelcandales Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>	2024-11-13 23:04:15 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Jiang, Yanbing	f77eb07662	Split int4wo weight packing (#139611 ) Fixes https://github.com/pytorch/ao/issues/1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756. Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611 Approved by: https://github.com/jerryzh168	2024-11-12 10:12:50 +00:00
Nikita Shulga	f5ffd55a32	[MPS] Add `torch.special.i1` op (#140196 ) By more-or-less copy-n-pasting `58b661cda2/aten/src/ATen/native/cuda/Math.cuh (L576)` Enable respective tests in test_mps.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/140196 Approved by: https://github.com/Skylion007	2024-11-11 16:57:53 +00:00
Nikita Shulga	103cbd7231	[MPS] Restrict MSELoss to floating types (#139960 ) Becuase if invoked with long type it crahses deep in MPSGraph framework and to keep parity with CPU Add test that validates that if dtype is not floating, both CPU and MPS implementations will error out Fix function name for `mse_loss_out_mps` as `__func__` for any structured op implementation is `impl` Fixes https://github.com/pytorch/pytorch/issues/139723 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139960 Approved by: https://github.com/kimishpatel ghstack dependencies: #139961, #139959	2024-11-08 00:28:54 +00:00
Sun, Jiayi	44df6522ee	add Half/BFloat16 support for grid_sample on CPU (#134812 ) Fix https://github.com/pytorch/pytorch/issues/127224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812 Approved by: https://github.com/Skylion007, https://github.com/mingfeima	2024-11-06 14:02:08 +00:00
Mikayla Gawarecki	ca43ecd599	Flip default on weights_only (#137602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #138936, #139221, #139433, #139541	2024-11-04 18:30:29 +00:00
Nikita Shulga	51adab0829	[MPS] Fix reduction ops outputs for empty tensors (#139446 ) By adding a switch for all reduction types, that either sets it to given value or raises runtime error. Before this change, reduction ops returned uninitialized values in many case Fixes https://github.com/pytorch/pytorch/issues/139400 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446 Approved by: https://github.com/Skylion007	2024-11-01 17:32:12 +00:00
Yoshimasa Niwa	6e85266a47	[MPS] Fixes SiLU on non-contiguous tensors (#139006 ) Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139006 Approved by: https://github.com/malfet	2024-10-30 15:44:59 +00:00
PyTorch MergeBot	38645e8a3e	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit `8aedc649bd`. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))	2024-10-29 04:54:37 +00:00
Nikita Shulga	652a2ab93e	[BE] Skip `print(foo)` tests (#139009 ) Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output. Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009 Approved by: https://github.com/Skylion007	2024-10-27 18:04:03 +00:00
Scott Wolchok	a3de067975	[PyTorch] Use 128-bit vectors for ARM64 (#137426 ) The correct vector length for ARM64 is 128 bits (16 bytes). We were previously using double this, apparently just because that would be the same length as AVX2. Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #138486, #138542, #138655, #138716, #138744	2024-10-26 00:20:35 +00:00
Nikita Shulga	1b31248933	[EZ] Fix typo in test_mps.py (#138738 ) s/emedding_weight/embedding_weight/ Stolen from `074766d9b4` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738 Approved by: https://github.com/atalman	2024-10-23 22:15:35 +00:00
Tom Ritchford	8aedc649bd	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 19:13:44 +00:00
Tom Ritchford	1bc73f3157	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 17:42:11 +00:00
Nikita Shulga	de16159e56	[MPS] Fix sliced cast (#138314 ) This fixes internal crash due to the invalid bufer size computation if sliced API is used Not sure what was the purpose of ```c++ IntArrayRef baseShape; if (src.is_view()) { baseShape = src._base().sizes(); } else { baseShape = getIMPSAllocator()->getBufferShape(src.storage().data()); } int flattenedShaped = 1; for (const auto i : c10::irange(baseShape.size())) { flattenedShaped *= baseShape[i]; } ``` As flattenShaped could be much easier computed as `[srcBuf lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do. When someone allocated buffer to hold say uint8 and that view-casted it to float16, attempt to compute `baseShape` returned sizes of original tensor in its data type, rather than size in new dtypes Fixes https://github.com/pytorch/pytorch/issues/137800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314 Approved by: https://github.com/albanD, https://github.com/DenisVieriu97	2024-10-19 05:17:09 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit `9f81270d75`. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit `e7a4ad3b40`. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
Nikita Shulga	0786b37260	[MPS] Add i0 op (#137849 ) More-or-less verbatim copy of `47c8aa8090/aten/src/ATen/native/Math.h (L101)` Plus a bit of a MPS boilerplate code Update test_mps.py to mark kaiser_window and i0 as passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849 Approved by: https://github.com/Skylion007	2024-10-14 22:50:01 +00:00
Nikita Shulga	ad38bad766	[MPS] Add `tri[lu]_indices` (#137648 ) Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980 Copy-n-paste kernel implementation from `13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)` though use `float` instead of `double` for square root computation Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #137601, #137647	2024-10-10 23:41:06 +00:00
Nikita Shulga	13cf8360d8	[MPS] Fix testing for generator operators (#137601 ) Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs` Discovered that: - `torch.randn` and `torch.randint` fall into the same undefined category - `torch.logspace` is not implemented for MPS - Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem) - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601 Approved by: https://github.com/albanD	2024-10-09 23:17:11 +00:00
Nikita Shulga	3d0cb81594	[MPS] Enable bfloat16 testing (#136987 ) By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`. I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987 Approved by: https://github.com/albanD ghstack dependencies: #137070	2024-10-01 17:10:07 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
Nikita Shulga	4af03e54b7	[MPS][BE] Use `None` as alias for all types (#137004 ) Test like `new_` and `empty_` fail the current implementation, see Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003	2024-09-30 19:06:13 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
Nikita Shulga	283bda01aa	[MPS] Error checking/bf16 support for `torch.normal` (#136863 ) Before that attempt to run something like ``` % python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))" ``` Resulted in hard error ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` After the change, it raises a nice type error Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821, #136822	2024-09-27 21:11:59 +00:00
Nikita Shulga	9d72f7481b	[MPS] Fix AvgPool2d for float16 (#136822 ) This was a stupid cast error that caused MPSGraph to crash with the following exception ``` (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results (mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<xf32> /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755, #136821	2024-09-27 15:32:18 +00:00
Nikita Shulga	2b6f4e9e24	[BE][MPS] Delete MacOS12 low-precision ops (#136821 ) `norm` and `masked.normalize` still have to stay in the list Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754, #136755	2024-09-27 15:32:18 +00:00
Nikita Shulga	69bd13d12e	[EZ][BE] Add `torch.complex` to MPS_DTYPES (#136755 ) As minimal supported OS has been rasied to MacOS 13, some basic complex operations should be supported, and the rest could be `xfailed` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755 Approved by: https://github.com/Skylion007 ghstack dependencies: #136754	2024-09-27 05:01:40 +00:00
Roy Hvaara	5789f8d5dc	[MPS] Add regression test for large inputs to `F.linear` (#136084 ) This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13. ~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-26 20:46:14 +00:00
Nikita Shulga	68579ef665	[EZ][MPS] Extend `arange` to bfloat16 (#136754 ) RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES` Fixes https://github.com/pytorch/pytorch/issues/136624 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754 Approved by: https://github.com/Skylion007	2024-09-26 15:33:45 +00:00
Nikita Shulga	73ec76ed50	[MPS] Implement `isposinf` and `isneginf` (#136689 ) Not sure, why `isinf` is a composite op, but those needs to be implemented by hand. Implementation is a trivial call to ```objc [mpsGraph equalWithPrimaryTensor:input secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity() dataType:input.dataType]] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689 Approved by: https://github.com/Skylion007	2024-09-26 15:33:20 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit `ab9a7eadd3`. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit `c33b0580e6`. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit `51c5206133`. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Kulin Seth	144fde4fd2	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Need to run inductor/test_cpu_select_algorithm Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Roy Hvaara <roy@lightyear.no>	2024-09-05 23:23:17 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
Joona Havukainen	92f282ca52	Enable batch matmul for result sizes > 232 the tensor can be split along batch axis (#133430 ) Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert. Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it: ``` import torch device='mps' a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device) b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device) res = torch.bmm(a, b) ``` Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 232. This lets us split up the computation along the batch axis to avoid going over the limit. Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-30 14:08:43 +00:00
Li-Huai (Allan) Lin	e7711d6c7d	[MPS] Fix SDP training (#134719 ) Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit. Fixes #134678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719 Approved by: https://github.com/malfet	2024-08-29 01:28:53 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Roy Hvaara	43f78bf37a	[MPS] Gather sliced inputs to batch norm (#133610 ) This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in `4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372)` to decide if gathering is necessary. It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs. ### Performance impact #### With fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 282 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 448 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 705 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 1.11 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.16 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 11.7 msec per loop ``` #### Without fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 284 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 265 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 715 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 675 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.19 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 7.13 msec per loop ``` Please feel free to push back or request changes. Fixes #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610 Approved by: https://github.com/malfet	2024-08-20 18:24:48 +00:00
Denis Vieriu	861bdf96f4	[MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393 ) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for MPS strided API (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>	2024-08-16 21:07:50 +00:00
Sun, Jiayi	7be77658e9	[Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155 ) This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance. BTW, I fixed the UT of `byte` by setting the range of the sample inputs to [0, 255] since the range of `torch.uint8` is [0, 255]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130724	2024-08-13 01:12:05 +00:00
Li-Huai (Allan) Lin	cc1cc71c46	[MPS] Fix relu for 0-element input case (#133191 ) Fixes #133182 Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191 Approved by: https://github.com/albanD	2024-08-12 19:24:17 +00:00
PyTorch MergeBot	2764bee942	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit `6919e8baab`. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](`6919e8baab`) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))	2024-08-05 19:59:04 +00:00
Kulin Seth	6919e8baab	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-08-05 17:02:30 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
Li-Huai (Allan) Lin	964f97539f	[MPS] Correct nonzero warning and fix the test (#132127 ) #125355 lifted the natively supported macOS version to 14. Fixes #132110 Probably fixes this flaky test disabling issue: #126492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127 Approved by: https://github.com/malfet	2024-07-30 19:46:25 +00:00
Li-Huai (Allan) Lin	a147fa577b	[MPS] Fix masked_fill_ in non_contiguous cases (#131957 ) fixes #131285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957 Approved by: https://github.com/DenisVieriu97	2024-07-30 01:34:48 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Joona Havukainen	082d0b80ca	Min and max NaN propagation fix in MPS backend (#130445 ) Partial fix to issue #130295 Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445 Approved by: https://github.com/malfet	2024-07-29 20:09:15 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
Manuel Candales	d6115439be	[MPS] Add SDPA implentation (#131362 ) This work is based off @malfet's #119200 Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362 Approved by: https://github.com/kimishpatel	2024-07-25 03:24:37 +00:00
Tom Ritchford	16247987a1	Add decomposition for t_copy (#130939 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939 Approved by: https://github.com/peterbell10	2024-07-23 08:29:19 +00:00
Joona Havukainen	102d8e5a63	MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038 ) The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2. Solves silentness of issue #125803. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038 Approved by: https://github.com/malfet	2024-07-23 06:30:40 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Xuehai Pan	d2bd9acabd	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519 ghstack dependencies: #130895	2024-07-20 02:41:10 +00:00
Li-Huai (Allan) Lin	8ea03372a1	[MPS] Store philox counter as part of the RNG state (#130662 ) Fixes #130613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662 Approved by: https://github.com/malfet	2024-07-18 15:57:28 +00:00
PyTorch MergeBot	074a5c0c9b	Revert "[BE] bump `optree` version to 0.12.1 (#130139 )" This reverts commit `8fcb156e8b`. Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py `8fcb156e8b` ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))	2024-07-15 19:42:11 +00:00
Xuehai Pan	8fcb156e8b	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519	2024-07-15 17:27:07 +00:00
Nikita Shulga	febadda107	[MPS] Fix `torch.[all\|any]` for 5+D tensors (#130542 ) Workaround bug in `reductionAndWithTensor:` that kills app with the following assert if 5+D tensor as an input ``` Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76. ``` by reshaping the tensor to 2D/3D one before running the reduction. Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue Enabled `test_reduction_ops_5D` and added regression test to it Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #130541	2024-07-12 15:06:22 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit `68751799b8`. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit `7d4cb21098`. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
PyTorch MergeBot	637cc8d27f	Revert "update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 )" This reverts commit `6367f02a0e`. Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main `6367f02a0e` ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))	2024-07-10 13:48:32 +00:00
Jiang, Yanbing	6367f02a0e	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-10 07:38:42 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
Joel Schlosser	c8ab2e8b63	Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238 ) This PR: * Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed). * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`: * Uncovered a bunch of test issues: * Test breakdown (>100 total) * A lot of tolerance issues (tweaked tolerance values to fix) * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype) * 3 actually broken semantics (for masked tensor; added xfails) * 4 Jacobian mismatches (added xfails) * 2 nan results (skip for now, need fixing) * 3 results too far from reference result (add xfails) * Skips MPS tests for now (there are so many failures!). Those will default to the old behavior. before (no seed setting): ``` real 0m21.306s user 0m19.053s sys 0m5.192s ``` after (with seed setting): ``` real 0m21.905s user 0m19.578s sys 0m5.390s ``` * Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command. ``` ====================================================================== ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper return test(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar self.assertFalse(True) AssertionError: True is not false The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper fn(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper raise new_e from e Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.037s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238 Approved by: https://github.com/janeyx99, https://github.com/justinchuby	2024-07-08 16:06:38 +00:00
PyTorch MergeBot	07450e9713	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit `6240cfd5c7`. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))	2024-07-02 12:29:51 +00:00
Kulin Seth	6240cfd5c7	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-07-02 01:49:52 +00:00
Huy Do	fdd0a7f9b4	Run test_mps_allocator_module serially (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-01 18:44:48 +00:00
Joona Havukainen	5b96a552df	Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484 ) Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value. The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484 Approved by: https://github.com/kulinseth	2024-06-28 20:57:40 +00:00
Manuel Candales	eabe6574c0	[metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628 Approved by: https://github.com/kimishpatel	2024-06-28 15:01:30 +00:00
Nikita Shulga	bc68907caa	[EZ][BE] Replace `assertTrue` with more appropriate checks (#129569 ) Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2024-06-26 16:29:59 +00:00
PyTorch MergeBot	b045878f81	Revert "Remove test_mps_allocator_module XFAIL (#129340 )" This reverts commit `c888ee3632`. Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))	2024-06-25 18:37:54 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
Isuru Fernando	5f912f480c	Fix max_pool2d decomposition for empty list and integer limits (#129106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #129096, #129097	2024-06-24 22:19:42 +00:00
Huy Do	c888ee3632	Remove test_mps_allocator_module XFAIL (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980	2024-06-24 16:26:38 +00:00
Manuel Candales	749c03406c	[metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965 ) Adds _convert_weight_to_int4pack MPS kernel Replaces previous int4mm Metal shader, with shader authored by @kimishpatel which improves perf by ~40% Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965 Approved by: https://github.com/malfet	2024-06-23 02:10:46 +00:00
Li-Huai (Allan) Lin	799acd31b4	[MPS] Add lu_factor (#99269 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d75cde1</samp> Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269 Approved by: https://github.com/kulinseth, https://github.com/lezcano	2024-06-20 07:35:29 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Joona Havukainen	d9eaa224f2	Fixes #128429 : NaN in triu op on MPS (#128575 ) Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead. Fixes #128429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575 Approved by: https://github.com/kulinseth	2024-06-18 03:44:42 +00:00
Nikita Shulga	9035fff2de	[BE] Do not test deprecated `torch.nn.utils.weight_norm` (#128727 ) Test `torch.nn.utils.parametrizations.weight_norm` instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727 Approved by: https://github.com/kit1980 ghstack dependencies: #128726	2024-06-14 19:14:44 +00:00
Nikita Shulga	27458cc097	[BE] Refactor repeated code in test_weight_norm (#128726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726 Approved by: https://github.com/kit1980	2024-06-14 19:14:44 +00:00
Tom Ritchford	edb45dce85	Add OpInfo entry for as_strided_copy (#127231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231 Approved by: https://github.com/lezcano	2024-06-13 13:58:47 +00:00
Nikita Shulga	0678742924	[MPS] Add Metal implementation of exp op (#128421 ) To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes https://github.com/pytorch/pytorch/issues/84936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421 Approved by: https://github.com/kulinseth ghstack dependencies: #128373, #128375	2024-06-13 06:53:17 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Tom Ritchford	2386045e4f	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-12 09:39:58 +00:00
Joona Havukainen	a5ba9b2858	Fix for addcdiv contiguous problem (#124442 ) Fixes issue number #118115 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442 Approved by: https://github.com/kulinseth	2024-06-06 16:09:18 +00:00
Huy Do	8992141dba	Restore MPS testing on MacOS 13 and m2 metal (#127853 ) The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out. This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582. The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853 Approved by: https://github.com/malfet	2024-06-05 14:44:00 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit `f03f8bc901`. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Nikita Shulga	045309aa35	[MPS] Enable toch.mm and friends for complex dtypes (#127241 ) - Add `supportedFloatingOrComplexType` - Change dtype check to those - Extend low-precision fp32 list to complex types - Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype Fixes https://github.com/pytorch/pytorch/issues/127178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241 Approved by: https://github.com/janeyx99	2024-05-28 17:56:13 +00:00
Nikita Shulga	4ff9113e3d	[MPS] Add `_weight_int8pack_mm` tests (#127041 ) As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041 Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales	2024-05-24 16:08:06 +00:00
jhavukainen	6a539e80dd	Update descriptor fields to resolve fft precision issue (#125328 ) Fixes #124096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-05-22 21:48:49 +00:00
jhavukainen	d28868c7e8	Change skipIfs to xfails in test_mps.py for test_isin (#125412 ) Follow-up to #124896 to move the added test to use expectedFailure instead of skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412 Approved by: https://github.com/kulinseth	2024-05-20 20:23:53 +00:00
Nikita Shulga	b8a706a321	[EZ][BE] Use `untyped_storage` in tests (#125838 ) Get's rid of the following warning: ``` /Users/shenke/workspace/pytorch/test/test_mps.py:9229: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() if base.storage().data_ptr() != other.storage().data_ptr(): ``` (noticed while looking at https://github.com/pytorch/pytorch/issues/96153#issuecomment-2101876484 ) Respective change to view ops was landed back in 2022, see https://github.com/pytorch/pytorch/pull/91414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125838 Approved by: https://github.com/albanD	2024-05-09 14:04:21 +00:00
Nikita Shulga	4e29e80bf0	Run MPS tests on MacOS Sonoma (#125801 ) Those ones are running 14.4.1, so I wonder if they actually pass CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-05-09 13:43:12 +00:00
Denis Vieriu	58e045d03c	[MPS] Fix strided ELU op (#125692 ) Fixes https://github.com/pytorch/pytorch/issues/124834 Summary of changes: In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125692 Approved by: https://github.com/kulinseth	2024-05-08 01:34:40 +00:00
Denis Vieriu	ba27548679	[MPS] Remove in place views (causes too many crashes) (#124895 ) Fixes https://github.com/pytorch/pytorch/issues/96153 Remove in place views as they are a general cause for many crashes. Proper fix to handle views without copies will come in a different PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124895 Approved by: https://github.com/kulinseth	2024-05-08 01:00:37 +00:00
Denis Vieriu	3fb53bb6a7	[MPS] Fix strided mse_loss (#125696 ) Fixes https://github.com/pytorch/pytorch/issues/124621 Summary of changes: - In case of non-contiguous input, the output would be non-contiguous too. At the moment it's not supported to save the result to a non-contiguous buffer, thus we need two steps, one to allocate a contiguous buffer and the second one to scatter the result back to the original ouput. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125696 Approved by: https://github.com/kulinseth	2024-05-08 00:52:26 +00:00
Nikita Shulga	0fd1fc17c3	[MPS] Fix `abs` for complex types (#125662 ) By calling `realPartOfTensor:` if input type is complex on Sonoma and fall back to `at::view_as_real` trick on Ventura. Split `unary_op` template into `unary_op` and `unary_op_noresize`, which skips resize and empty checks Marked `abs`, `isclose` and `nn.functional.softsign` OpInfo tests as supported by complex types Fixes https://github.com/pytorch/pytorch/issues/125135 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125662 Approved by: https://github.com/kulinseth	2024-05-07 22:15:20 +00:00
Nikita Shulga	30610251ec	[MPS] And naive quantized intmm and `.gputrace` capture hooks (#125163 ) - Implement a very straightforward Metal copy of CPU int4mm kernel - Implement int8mm kernel by constructing a graph consisting of upcast, transpose and mm - Add `isCapturing`, `isCaptureEnabled`, `startCapture` and `stopCapture` methods to `MPSProfile` which can be used to help one debug/profile Metal kernels by wrapping the calls with the following ```cpp if (getMPSProfiler().profiler.isCaptureEnabled()) { getMPSProfiler().startCapture(__func__, mpsStream); } ... if (getMPSProfiler().isCapturing()) { getMPSProfiler().stopCapture(mpsStream); } ``` that, if invoked with `MTL_CAPTURE_ENABLED` environment variable set to one, will produce .gputrace files, in the current working directory, which can later be loaded and used to debug or profiler the kernel <img width="1093" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/a2bf27e8-df8a-442c-a525-1df67b8a376a"> - Added `test_int4mm` to TestLinalgMPS, which is mostly copy-n-paste of the test from `test_linalg` TODOs: - Add weight pack - Perf-tune both kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/125163 Approved by: https://github.com/mikekgfb	2024-05-03 15:20:39 +00:00
Denis Vieriu	a40d6df448	[MPS] Native nonzero implementation (#125355 ) Fixes https://github.com/pytorch/pytorch/issues/124850 Replace previous MPSGraph nonzero construction with native nonzero op. For older OSes, fallback to CPU (previous implementation was not reliable and was comparable to CPU in speed). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125355 Approved by: https://github.com/kulinseth	2024-05-03 03:50:58 +00:00
Roy Hvaara	e15da7856c	[MPS] Fix overflow in cumsum when dtype is bool (#125318 ) `cumsum` and `cumprod` was (is?) buggy for MPS: `c8d2a55273/aten/src/ATen/native/mps/operations/UnaryOps.mm (L435-L436)` A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types. It turns out this issue also affects boolean types: ```python import torch print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1]) # tensor(-128, device='mps:0') ``` In this PR I'm adding logic to also cast bool dtypes to int32 prior to `cumsum` and `cumprod`, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions. Fixes #96614 #106112 #109166 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125318 Approved by: https://github.com/malfet	2024-05-03 01:19:24 +00:00
Joona Havukainen	c451d108da	Implemented isin_Tensor_Tensor_out for MPS backend (#124896 ) Addresses issue #124518, adds isin_Tensor_Tensor_out. Tests added to test_mps.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124896 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-05-01 23:14:05 +00:00
Nikita Shulga	5944a53555	[MPS] Fix nextafter for negative values (#125029 ) By changing the logic to on older MacOS: ```cpp bits += ((input > 0) ^ (input > other)) ? 1 : -1; ``` And use native `nextafter` on MacOS Sonoma (i.e. if Metal 3.1 is available) TODO: - Add tests for infs and denorms Fixes https://github.com/pytorch/pytorch/issues/124985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125029 Approved by: https://github.com/Skylion007	2024-04-27 02:58:05 +00:00
Nikita Shulga	db3a2d751c	[MPS][BE] Error-check linear (#124952 ) Validate that all arguments are on MPS devices and dtypes are expected Fixes cryptic messages like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32), torch.rand((32, 32), device='mps')))" RuntimeError: Placeholder storage has not been allocated on MPS device! ``` And hard crashes like ``` % python3 -c "import torch;print(torch.nn.functional.linear(torch.rand(32, 32, device='mps'), torch.randint(-10, 10, (32, 32), dtype=torch.int8, device='mps')))" ``` Fixes https://github.com/pytorch/pytorch/issues/123995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124952 Approved by: https://github.com/Skylion007	2024-04-25 23:25:20 +00:00
Nikita Shulga	abf3f90781	[MPS] Fix large copy (#124635 ) By slicing `copyFromBuffer:sourceOffset:toBuffer:destinationOffset:size:` into 2Gb chunks Add regression test, but limit it to machines with 12Gb of RAM or more, and MacOS 14+, as on MacOS 13 attempt to alloc 4Gb tensor fails with: ``` /AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32' ``` Fixes https://github.com/pytorch/pytorch/issues/124335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124635 Approved by: https://github.com/kulinseth	2024-04-22 23:43:11 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Joël Tang	a6a3f2e06b	[MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049 ) Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes. I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D. I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards(). This PR also fixes an issue with LeakyRELU on empty tensors. Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049 Approved by: https://github.com/kulinseth	2024-04-21 00:12:32 +00:00
Nikita Shulga	5677128cb8	[MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258 ) By creating constants using input tensors dtype One line reproducer: ``` python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))" ``` Before the change ``` loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible LLVM ERROR: Failed to infer result type(s). ``` After ``` tensor(-33.7812, device='mps:0', dtype=torch.float16) ``` Fixes https://github.com/pytorch/pytorch/issues/124252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258 Approved by: https://github.com/kulinseth	2024-04-18 15:21:01 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
Pearu Peterson	d2b0c0a34e	Fix index_reduce sampler filter when op_info.variant_test_name is specified (#123375 ) As in the title: `index_reduce` sample must correspond to reduction type specified by `variant_test_name`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123375 Approved by: https://github.com/zou3519, https://github.com/peterbell10	2024-04-17 15:31:28 +00:00
FFFrog	acc466751b	Add bfloat16 support to binary_cross_entropy for CPU (#123823 ) Fixes #123715 As the title stated. But, maybe we should pay attention to this https://github.com/pytorch/pytorch/pull/33206, which removed the half support for cpu about 4 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123823 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-04-17 09:44:07 +00:00
Joona Havukainen	05289a278c	Fix for MPS regression in #122016 and #123178 (#123234 ) Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds. Before the fix on MacOS 14.4: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 3., 3.], device='mps:0') ``` After the fix: ``` python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)" tensor([0., 1., 3.], device='mps:0') ``` This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+ Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234 Approved by: https://github.com/malfet, https://github.com/kulinseth	2024-04-03 23:00:57 +00:00
PyTorch MergeBot	feabb645a7	Revert "Handle transposes in second batch of matrices in bmm (#122194 )" This reverts commit `251ad1232b`. Reverted https://github.com/pytorch/pytorch/pull/122194 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/122194#issuecomment-2032806360))	2024-04-02 18:49:28 +00:00
Kulin Seth	251ad1232b	Handle transposes in second batch of matrices in bmm (#122194 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/122194 Approved by: https://github.com/DenisVieriu97	2024-04-02 17:48:35 +00:00
Nikita Shulga	4c70ab26ef	[MPS] Enable `index_select` for complex types (#122590 ) Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick Fixes https://github.com/pytorch/pytorch/issues/122427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590 Approved by: https://github.com/Skylion007	2024-03-25 16:57:35 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Roger Lam	40acc84aaf	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-18 19:38:15 +00:00
PyTorch MergeBot	0cc60a05da	Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381 )" This reverts commit `ca80d07ac7`. Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR. Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))	2024-03-15 23:53:05 +00:00
Roger Lam	ca80d07ac7	Fix torch.clamp in MPS to handle NaN correctly (#121381 ) Fixes #120899 So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers. https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381 Approved by: https://github.com/malfet	2024-03-15 21:54:50 +00:00
Nikita Shulga	5498804ec2	[MPS] Fix naive matmul for BFloat16 (#121731 ) Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate Fixes https://github.com/pytorch/pytorch/issues/121583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731 Approved by: https://github.com/albanD	2024-03-13 14:34:03 +00:00
Nikita Shulga	07330ff7b6	[MPS][BE] Define `_compute_tolerances` (#121754 ) Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match` So move tolerance definition logic into a shared `_compute_tolerances` function and only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions. Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754 Approved by: https://github.com/albanD	2024-03-13 04:08:06 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
Nikita Shulga	9b03a06288	[BE] [MPS] Fix `out` resize logic in `torch.where` (#121476 ) By deleting `where_mps` and registering MPS dispatch for `where_kernel`. As result of this change resizing and type-checking logic is shared between MPS, CPU and CUDA backends. Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476 Approved by: https://github.com/albanD, https://github.com/Skylion007 ghstack dependencies: #121473, #121494	2024-03-08 18:59:37 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Kai	c59b14163b	Implement aten::upsample_linear1d on mps (#115031 ) Related to #77764 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031 Approved by: https://github.com/malfet	2024-02-26 23:04:52 +00:00
Nikita Shulga	53bfae2c06	[MPS] Add `torch.fft.` support (#119670 ) Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 ) When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing. Fixes https://github.com/pytorch/pytorch/issues/78044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670 Approved by: https://github.com/kulinseth, https://github.com/albanD	2024-02-20 18:23:06 +00:00
Nikita Shulga	eb9a3383c2	[MPS] Add naive std_mean implementation (#119777 ) By just calling `std_mps` and `mean` in sequence Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script: ```python from timeit import default_timer import torch from torch.utils.benchmark import Measurement, Timer def bench_var_mean( m, n, k, dtype = torch.float32, device:str = "cpu", ) -> Measurement: setup = f""" x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}") """ t = Timer( stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer ) return t.blocked_autorange() for x in [100, 1000]: rc = bench_var_mean(1000, x, 100, device="mps") print(f"{x:5} : {rc.mean*1e6:.2f} usec") ``` which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter. Fixes https://github.com/pytorch/pytorch/issues/119663 TODOs: - Refactor the codebase and implement proper composite function (that must be faster) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777 Approved by: https://github.com/albanD	2024-02-13 21:51:29 +00:00
Nikita Shulga	15ef52a015	[MPS] Enable `conj` and `conj_physical` (#119669 ) Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor Preliminary step for enabling FFT ops (without it `ifft` would never work) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669 Approved by: https://github.com/albanD ghstack dependencies: #119681	2024-02-13 02:27:51 +00:00
Nikita Shulga	8d8fb9783c	[MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681 ) By using `view_as_real` when type casting between two complex types Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-02-12 19:09:10 +00:00
Pearu Peterson	2c91e13afc	Add lowerings to special functions (#119187 ) As in the title. In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187 Approved by: https://github.com/peterbell10	2024-02-11 16:35:40 +00:00
Nikita Shulga	4ee8aac432	[MPS] Enable `bfloat16` support on MacOS 14 (#119641 ) Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h` TODO: Enable more testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641 Approved by: https://github.com/Skylion007	2024-02-11 16:25:29 +00:00
Nikita Shulga	1d61011c11	[MPS] Add support for complex scalars (#119318 ) - Switch to native complex support if running on MacOS Monterey or newer for binary ops. - Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats - Also add `cf`(for complex float) and `ch`(for complex half) to MPSScalar value union - Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view. - Add `test_tensor_scalar_binops` Fixes https://github.com/pytorch/pytorch/issues/119088 Test plan: CI (have quite a lot of tests, see new unexpected successes) + `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318 Approved by: https://github.com/albanD	2024-02-08 18:10:59 +00:00
watarungurunnn	d444a3b443	[MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771 ) Fixes #114285 (However, still have NotImplementedError ```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771 Approved by: https://github.com/lezcano	2024-02-05 15:36:55 +00:00
lancerts	26a2743162	Fix placeholder tensor is empty for relu in mps (#118965 ) Fixes #118845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965 Approved by: https://github.com/malfet	2024-02-03 23:50:35 +00:00
Nikita Shulga	24dd9f42ce	[MPS] Fix `use_metal_mm` condition (#118830 ) One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)` Extend test to account for this situation Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830 Approved by: https://github.com/huydhn	2024-02-01 17:53:42 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
Sun, Jiayi	2dd4a254a0	add Half support for interpolate operators on CPU (#105648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2024-01-18 09:07:16 +00:00
Nikita Shulga	1872834247	[MPS] Fix `torch.mm` correctness for large matrices (#117549 ) Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows: ```objc NSMutableArray<MPSGraphTensor> rows = [NSMutableArray new]; for (int64_t i = 0; i < M; i += tile_size) { const auto i_end = std::min(i + tile_size, M); NSMutableArray<MPSGraphTensor> row_chunks = [NSMutableArray new]; for (int64_t j = 0; j < K; j += tile_size) { const auto j_end = std::min(j + tile_size, K); MPSGraphTensor* tile = nil; for (int64_t k = 0; k < N; k += tile_size) { const auto k_end = std::min(k + tile_size, N); auto selfChunk = [graph sliceTensor:selfTensor starts:@[ @(i), @(k) ] ends:@[ @(i_end), @(k_end) ] strides:@[ @(1), @(1) ] name:nil]; auto otherChunk = [graph sliceTensor:otherTensor starts:@[ @(k), @(j) ] ends:@[ @(k_end), @(j_end) ] strides:@[ @(1), @(1) ] name:nil]; auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil]; tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM; } [row_chunks addObject:tile]; } auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject; [rows addObject:row]; } return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject; ``` One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable Fixes https://github.com/pytorch/pytorch/issues/116769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549 Approved by: https://github.com/kulinseth	2024-01-17 01:33:08 +00:00

... 2 3 4 5 6 ...

743 Commits