Fixes#131337
- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
workspace.zero_()
.....
triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
del buf2, arg0_1, arg1_1, workspace
```
- add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.
The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.
```cpp
static constexpr int64_t int_array_0[] = {1280L, };
static constexpr int64_t int_array_1[] = {1L, };
AtenTensorHandle workspace_handle;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle));
RAIIAtenTensorHandle workspace(workspace_handle);
workspace.zero_();
```
- Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid`
- Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.
The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.
- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs
```cpp
at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
workspace.zero_();
```
Test Plan:
```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533
Approved by: https://github.com/jansel
ghstack dependencies: #135531
Fix https://github.com/pytorch/pytorch/issues/134768 .
When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel
We subtract result 2 from result 1 to get the latency of the kernel itself.
But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).
The fix is to set the correct current device in our benchmarking code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531
Approved by: https://github.com/jansel
SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such.
Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed.
When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected.
804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648
Approved by: https://github.com/peterbell10, https://github.com/jansel
If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor.
Fixes #ISSUE_NUMBER
@yanboliang @jansel @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859
Approved by: https://github.com/jansel
When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time:
```
# With the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.038ms 0.067 GB 1777.11GB/s triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py)
0.04ms 0.07 GB 1777.11GB/s
# Without the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.040ms 0.067 GB 1663.09GB/s triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py)
0.04ms 0.07 GB 1663.09GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523
Approved by: https://github.com/nmacchioni
Summary:
Follow up small diff to fix a couple issues:
- add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen
- other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option
Test Plan:
```
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```
Differential Revision: D60954888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016
Approved by: https://github.com/ColinPeppler
Reland by reverting commit 844103197d. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue.
Original description from #131675
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py
ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py
This is part 2 pull request which deals with the 2nd case above:
The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.
Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.
Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.
Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291
Approved by: https://github.com/wdvr
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py
ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py
This is part 2 pull request which deals with the 2nd case above:
- The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.
- Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.
Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.
Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
Differential Revision: D60067757
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675
Approved by: https://github.com/mlazos
Summary:
**Context:**
Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866)
The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process.
**Changes:**
1. Added a simple initial debug printer helper to print out tensor values
2. Added a filter option to selectively dump tensor values.
**Usage:**
Sample cmd :
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda
```
Sample outputs :
```
[ before_launch - triton_poi_fused_0 - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ after_launch - triton_poi_fused_0 - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ before_launch - aoti_torch_cuda_addmm_out - buf1 ]:
Min value: -2.25655
Max value: 2.32996
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0
[ before_launch - aoti_torch_cuda_addmm_out - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
[ after_launch - aoti_torch_cuda_addmm_out - buf1 ]:
Min value: -12.0839
Max value: 11.6878
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0
[ after_launch - aoti_torch_cuda_addmm_out - buf0 ]:
0.6331
1.6358
-0.3459
1.0196
-0.4122
1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)]
.
----------------------------------------------------------------------
Ran 1 test in 10.867s
OK
```
The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below:
```
torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out']
```
In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`.
Test Plan:
Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part)
`AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda`
Differential Revision: D60538496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323
Approved by: https://github.com/ColinPeppler
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)""
The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.
The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.
See D54134695 or #124969 for more details.
Test Plan:
Originally failed tests
f585704630
f585733786
Diff patched:
f586664028
f586663820
Differential Revision: D60458597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```
We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.
in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.
This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
Closes#129507
This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
mask from that which saves an additional reduction in the sort
kernel's inner loop.
In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
a multi-kernel to be created but this isn't supported. Fix is to
never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
KeyError during codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
Fixes#126338
## Issue Summary
When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.
See more great discussions in the #126338
This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.
## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.
## Example
```python
@torch.compile
def fn(x, y):
x = x.view(torch.float16)
y = y.view(torch.float16) + 1
return x @ y
x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
tl.store(out_ptr0 + (x0), tmp1, xmask)
triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
tmp2 = 1.0
tmp3 = tmp1 + tmp2
tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
del arg0_1
buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
del arg1_1
buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.
The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.
The following are output code generated after this PR.
```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
tmp2 = 1.0
tmp3 = tmp1 + tmp2
tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
del arg1_1
buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.
The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.
7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)
Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison