This accomplishes following:
- Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
- Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
- Eliminates need for several correctness workarounds
Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
Noticed some of these ops were contributing to a big chunk of the runtime for OpenLLama as well as a few other benchmarks
At the op level, moving to a TensorIterator-based Metal kernel gives a 20x speedup. Will migrate the inverse trigonometric functions & log ops in a follow-up PR, as this one is already a bit large
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152876
Approved by: https://github.com/malfet
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24., 3.]])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
As well as `.add`/`.sub` with complex alpha
Before this change `python3 -c "import torch;print(torch.rand(10, device='mps', dtype=torch.complex64).add(torch.rand(10, device='mps', dtype=torch.complex64), alpha=.5j))"` used to fail with
```
RuntimeError: value cannot be converted to type double without overflow
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152479
Approved by: https://github.com/dcci
ghstack dependencies: #152443, #152466
By instantiating it implicitly, otherwise attempts to run something like
```
% python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))"
```
will fail with
```
Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error"
```
Similar in spirit to https://github.com/pytorch/pytorch/pull/149123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204
Approved by: https://github.com/dcci
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support
TODOs:
- Get rid of indexing kernel and compute it directly when kernel is run
- Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
- Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci