Commit Graph

1235 Commits

Author SHA1 Message Date
Oguz Ulgen
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
Xu Han
a4013e8b72 [inductor] cpp codegen alignas for all OSs. (#132387)
Changes:
1. Make cpp codegen alignas works for all OSs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-01 14:30:09 +00:00
eellison
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
PyTorch MergeBot
10344d76bd Revert "[AOTI] Fix bfloat16 in CPU (#132150)"
This reverts commit a488113062.

Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](a488113062). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))
2024-08-01 03:35:39 +00:00
Shangdi Yu
a488113062 [AOTI] Fix bfloat16 in CPU (#132150)
Fixes #122986

- add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file

- Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare]
  436 |   if (tensor.numel() != numel) {

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-07-31 23:28:24 +00:00
Peter Bell
260c991e20 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-31 21:32:20 +00:00
Aaron Orenstein
6214b5388b typing ir.py - part 1 (#131845)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-31 17:37:14 +00:00
PyTorch MergeBot
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
eellison
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
Xuehai Pan
e7eeee473c [BE][Easy][14/19] enforce style for empty lines in import segments in torch/_[a-c]*/ and torch/_[e-h]*/ and torch/_[j-z]*/ (#129765)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765
Approved by: https://github.com/ezyang
2024-07-31 10:42:50 +00:00
Xu Han
aa1488fe02 [inductor] turn on enable_kernel_profile on Windows. (#132025)
Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor.

Local tested pass:
![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 03:02:09 +00:00
leslie-fang-intel
f8e4060484 [Inductor][CPP] Enhance cppcsevar data type deduce (#130827)
**Summary**
Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](096dc444ce/torch/_inductor/codegen/common.py (L1844))), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction:

- We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`.

- To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:51:31 +00:00
Yuzhen Huang
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
eellison
8b507a922a Mode to emulate amp numerics (#131595)
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```

We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.

in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.

This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
2024-07-29 22:42:23 +00:00
Yang Chen
05a8540041 [cpp-wrapper] create null pointer for zero-size array (#132023)
zero-size array is not supported in the C or C++ standard,
so we create a null pointer for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023
Approved by: https://github.com/desertfire
2024-07-29 21:40:33 +00:00
PyTorch MergeBot
957a89f56c Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761)"
This reverts commit 03760be271.

Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](03760be271) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))
2024-07-29 15:52:08 +00:00
Wu, Chunyuan
30e7fc0fe1 Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557)
Fix the compilation error:
```cpp
/tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16*’ {aka ‘const c10::BFloat16*’}
  401 |     cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1);
      |                        ^~~~~~
      |                        |
      |                        at::Tensor
```

The generated code after the fix will be:
```cpp
cpp_fused_div_mm_0((bfloat16*)(arg2_1.data_ptr()), (bfloat16*)(constant2.data_ptr()), (bfloat16*)(_frozen_param1.data_ptr()), (bfloat16*)(buf1.data_ptr()));
```

Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557
Approved by: https://github.com/leslie-fang-intel
2024-07-29 04:01:17 +00:00
Peter Bell
03760be271 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-29 03:14:13 +00:00
PyTorch MergeBot
945bf78894 Revert "[BE] typing for decorators - fx/_compatibility (#131568)"
This reverts commit 193f62fde9.

Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
Shangdi Yu
02b922900b [aoti] Fix float16 and bfloat16 for generated GPU code (#131437)
Fixes #131333

Summary:
- Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`.
- change `float16` and `bfloat16` to `float` before passing to kernel.

code generated before:
```cpp
.....
    half var_1;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1));
....
```

code generated now:
```cpp
typedef at::Half half;
typedef at::BFloat16 bfloat16;
.....
    half var_1_tmp;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp));
    float var_1 = float(var_1_tmp);
....
```

Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda`
Work in progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437
Approved by: https://github.com/desertfire
2024-07-26 23:36:11 +00:00
Peter Bell
16cd1aaa1d [inductor] Improve sort kernel perf (#131719)
Closes #129507

This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
   mask from that which saves an additional reduction in the sort
   kernel's inner loop.

In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
2024-07-26 21:56:47 +00:00
Peter Bell
608057afe2 [inductor] Fix duplicated range tree codegen in split scan (#131669)
Looks like in the halide codegen refactor, the range tree codegen was
split out from initialize_range_tree into its own function, but
triton_split_scan.py wasn't updated to reflect this change.

The result was the codegen gets invoked twice which is benign but makes
the kernel harder to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669
Approved by: https://github.com/Chillee
2024-07-26 13:11:26 +00:00
Colin Peppler
2ff98bc57f [inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726
Approved by: https://github.com/desertfire
ghstack dependencies: #131253
2024-07-26 00:58:04 +00:00
Peter Bell
9ae288f4be [inductor] Simplify multi-kernel codegen by unifying kernel args (#127724)
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.

Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
2024-07-26 00:12:43 +00:00
Colin Peppler
f885a70fab [inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253)
## What is sympy fn str arg?
It's  a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`)

## Crash

```
torch/_inductor/sizevars.py", line 468, in symbolic_hint
    expr = self.simplify(expr)        # where expr is 'sqrt'
torch/_inductor/sizevars.py", line 66, in simplify
    return sympy.expand(expr).xreplace(self.replacements)
sympy/core/function.py", line 2816, in expand
    return sympify(e).expand(deep=deep, modulus=modulus, **hints)
AttributeError: 'function' object has no attribute 'expand'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253
Approved by: https://github.com/desertfire
2024-07-25 23:31:20 +00:00
Aaron Orenstein
193f62fde9 [BE] typing for decorators - fx/_compatibility (#131568)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519
2024-07-25 22:24:19 +00:00
Jiong Gong
316c0d3e6b [inductor][cpp][gemm] support k slicing for static shapes (#130821)
This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0108 ms 100.0%
  _linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0272 ms 100.0%
  _linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0781 ms 100.0%
  _linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0260 ms 100.0%
  _linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0275 ms 100.0%
  _linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0284 ms 100.0%
  _linear_pointwise 0.1686 ms 16.8%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #131024
2024-07-25 13:36:38 +00:00
Peter Bell
2784b3f1b7 [inductor] Fix split-scan interaction with multi-kernel (#131044)
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
   a multi-kernel to be created but this isn't supported. Fix is to
   never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
   KeyError during codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
2024-07-25 11:36:36 +00:00
Yunqiu Guo
059f9fb30b [BE][inductor] Type annotate codecache.py and config.py (#131427)
As title.

Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-25 05:54:38 +00:00
angelayi
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
eellison
5772c13f56 Dont wrap negative indexing in scatter reduce (#131503)
Fix for https://github.com/pytorch/pytorch/issues/131321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
Jiong Gong
76f7b3e560 [inductor][cpp][gemm] improve thread blocking heuristics (#131024)
This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse.

Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1010 ms 100.0%
  cpp_packed_gemm_0 0.8303 ms 12.2%
0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1)

After this PR:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1143 ms 100.0%
  cpp_packed_gemm_0 0.1228 ms 93.1%
V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8)
V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8)
V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w
2024-07-24 00:36:29 +00:00
Aaron Orenstein
e3ca4e79e1 Fix mypy errors introduced by #131400 (#131522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-07-23 21:25:21 +00:00
Feng Shi
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
Yueming Hao
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
eellison
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
Henry Tsang
0246b28510 [aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868)
Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582.

The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well.

Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now.

The reason this wasn't known earlier is probably because the CI doesn't use H100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868
Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire
2024-07-22 15:24:20 +00:00
xinan.lin
8da19fec60 [Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849
Approved by: https://github.com/peterbell10, https://github.com/EikanWang
2024-07-22 05:59:03 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
Wu, Chunyuan
a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348)
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
     - size_of_B < L1
     - size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

## Performance
No regressions. Models with > 3% performance speedup are listed below:

### BF16 single thread (measured on CPU with AMX support)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

### FP32 single thread (measured on Ice Lake)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

### Next step
The E2E level improvement is limited due to the below reasons:

- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.

- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
2024-07-20 06:53:31 +00:00
Jiong Gong
0b44e1a74c [inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690)
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.

Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.

Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  _linear_pointwise 2.3563 ms 100.0%
  cpp_packed_gemm_0 710.5902 ms 0.3%

After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  cpp_packed_gemm_0 1.8909 ms 100.0%
  _linear_pointwise 2.1016 ms 90.0%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
2024-07-20 06:30:15 +00:00
Peter Bell
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
Xuehai Pan
f0075c179b Pin sympy >= 1.13.0 (#130895)
------

The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8.

- #130836

See the PR description of #130836 for more details.

`sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically:

- Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations)

> BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x*2.0 != x*2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr))

`sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689.

- #130689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895
Approved by: https://github.com/ezyang
2024-07-20 00:59:24 +00:00
peaceorwell
6657b14a64 [inductor] Fix the method for checking the variable type of entry.numel (#131026)
The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer.

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026
Approved by: https://github.com/peterbell10
2024-07-19 22:51:11 +00:00
Jiong Gong
39493aa934 [inductor][cpp][gemm] move bias add to epilogue (#130675)
Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.9200 ms 100.0%
  _linear_pointwise 1.9345 ms 99.3%

After
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.8321 ms 100.0%
  _linear_pointwise 1.9246 ms 95.2%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-07-19 01:16:34 +00:00
eellison
e14d1d10ef Unwrap Identity in prepare indexing (#130967)
We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](752c817898/torch/_inductor/codegen/triton.py (L1554)) which would do wrapping with tl.full.

I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64.

Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-07-18 00:43:53 +00:00
Bin Bao
752c817898 [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796)
Summary: Unify the argment codegen logic between python wrapper and cpp wrapper.

Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796
Approved by: https://github.com/oulgen
2024-07-17 18:37:23 +00:00
Isuru Fernando
b7d2abd766 Fix vectorized ops.masked (#130130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-17 14:55:11 +00:00
Colin Peppler
f272e0ab4a [inductor] support unbacked symint divisors in vars_and_sizes (#130595)
Scenario:
```
>>> nodes
IterationRangesEntry(
    x2,
    divisor=192*u0 + 192576,
    length=s1,
    (xindex//(192*u0 + 192576)),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x1,
    divisor=192,
    length=u0 + 1003,
    ModularIndexing(xindex, 192, u0 + 1003),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x0,
    divisor=1,
    length=192,
    ModularIndexing(xindex, 1, 192),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
```

Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2024-07-16 16:21:38 +00:00
Jiong Gong
705da70f2c [inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677)
The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers.

`pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py`

before
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = at::vec::convert<float>(tmp3);
            auto tmp5 = tmp1 + tmp4;
            auto tmp6 = at::vec::convert<bfloat16>(tmp5);
            tmp6.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

after
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = tmp1 + tmp2;
            auto tmp5 = at::vec::convert<bfloat16>(tmp4);
            tmp5.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677
Approved by: https://github.com/leslie-fang-intel
2024-07-16 13:25:05 +00:00