# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.
This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.
# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.
Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.
This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta
Co-authored-by: eellison <elias.ellison@gmail.com>
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.
# Feature
Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)
# Test plan
The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.
This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.
# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.
Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.
This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta
Co-authored-by: eellison <elias.ellison@gmail.com>
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945, #140057, #141495, #141882
If (ynumel / YBLOCK) > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751
Approved by: https://github.com/eellison
Co-authored-by: George White <georgew@graphcore.ai>
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.
Tested by the existing CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
Summary:
Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.
Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.
Reviewed By: blaine-rister
Differential Revision: D65518033
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693
Approved by: https://github.com/blaine-rister
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.
Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.
Follow ups:
- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
A couple changes.
- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.
- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.
Tests get added later in the stack when everything is implemented.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
Here's a markdown summary for the PR:
# Add workspace buffer support for Triton templates
## Summary
Adds support for templates to allocate and use temporary workspace buffers
## Key Changes
- Add `WorkspaceArg` support in Triton template system
- Automatic workspace allocation/deallocation around kernel execution
- Zero-initialization support for workspace buffers
- Seamless integration with existing tensor management
## Example Usage
```python
def generate(self, ...):
workspace_arg = WorkspaceArg(
count=1024*1024, # 1MB workspace
zero_fill=True # Zero-initialized
)
return TritonTemplateCaller(..., workspace_arg=workspace_arg)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050
Approved by: https://github.com/Chillee, https://github.com/eellison
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.
- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.
Test Plan: CI
Differential Revision: D61815079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison, https://github.com/blaine-rister