Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential [disconnected] Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Note the attached diff contains some minor fbcode-only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051
Approved by: https://github.com/YUNQIUGUO
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue
```
---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2
dict_values([
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])])
---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2
dict_values([
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
])
```
- The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer`
- The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer`
They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199
Approved by: https://github.com/jansel
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.
Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in
`TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067
Approved by: https://github.com/shunting314, https://github.com/jansel
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.
Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226