A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager.
Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465
Approved by: https://github.com/malfet
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
- Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions
TODOs:
- Implement efficient reduction using cooperative functions such as `simd_shuffle_down`
- Figure out how to merge several sum reduction together
- Implement `reduction_store` that will only write results from the first thread
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
Using Philox4 as PRNG
Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext
def foo(x):
return x * torch.randn_like(x)
foo_c = torch.compile(foo)
x = torch.ones(100, 100, device="mps")
y = foo_c(x)
print(y.mean().item(), y.std().item())
for i in range(25):
print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1
TODO: Better `randint` algorithm for large ranges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
May be to be later reused from eager op as well
Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
`metal::isnan` is only defined for floats, so provide a generic wrapper
that is false for integral types
TODO: Figure out why type propagantion is not working (or should it?)
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665
Approved by: https://github.com/dcci
Now error message looks as follows:
```
% python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps
test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call []
stats [('calls_captured', 6)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)]
ERROR
======================================================================
ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper
method(*args, **kwargs)
File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test
return value(self)
File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d
self.common(
File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu
check_model(
File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model
actual = run(*example_inputs, **kwargs)
File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
raise InductorError(e, currentframe()).with_traceback(
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile
compiled_fn = graph.compile_to_module().call
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module
return self._compile_to_module()
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module
mod = PyCodeCache.load_by_key_path(
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path
mod = _reload_python_module(key, path)
File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module>
File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader
torch._inductor.exc.InductorError: SyntaxError: failed to compile
kernel void generated_kernel(
device float* out_ptr0,
constant float* in_ptr0,
uint xindex [[thread_position_in_grid]]
) {
long x1 = (xindex) / (3);
auto tmp0 = x1;
auto tmp1 = static_cast<long>(tmp0);
auto tmp2 = 0;
auto tmp3 = tmp1 >= tmp2;
auto tmp4 = 2;
auto tmp5 = tmp1 < tmp4;
long x0 = (xindex) % (3);
auto tmp6 = in_ptr0[x0 + 3*(x1)];
auto tmp7 = tmp5 ? tmp6 : 0.0;
auto tmp8 = tmp1 >= tmp4;
auto tmp9 = 2 + ks0;
auto tmp10 = static_cast<long>(tmp9);
auto tmp11 = tmp1 < tmp10;
auto tmp12 = 1.0;
auto tmp13 = tmp8 ? tmp12 : 0.0;
auto tmp14 = tmp5 ? tmp7 : tmp13;
long x2 = xindex;
out_ptr0[x2] = static_cast<float>(tmp14);
}
with program_source:18:25: error: use of undeclared identifier 'ks0'
auto tmp9 = 2 + ks0;
^
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
To execute this test, run the following from the base repo dir:
python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.472s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #144647, #144648
Just pass them as kernel arguments
After this change `pytest test/inductor/test_torchinduct.py -v -k _mps` reports 330 failed, 429 passed after and 335 failed, 424 passed before
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144662
Approved by: https://github.com/jansel