Commit Graph

238 Commits

Author SHA1 Message Date
Shunting Zhang
fbafff3668 [reland][inductor] benchmark fusion (#112450)
reland https://github.com/pytorch/pytorch/pull/108193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450
Approved by: https://github.com/jansel
2023-10-31 18:17:06 +00:00
PyTorch MergeBot
64fd027f2e Revert "[inductor] benchmark fusion (#108193)"
This reverts commit 73cc5d1cdd.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))
2023-10-27 01:40:06 +00:00
Shunting Zhang
73cc5d1cdd [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 22:18:37 +00:00
PyTorch MergeBot
485cc0faae Revert "[inductor] benchmark fusion (#108193)"
This reverts commit ec0cdcdf6a.

Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))
2023-10-26 16:41:20 +00:00
Shunting Zhang
ec0cdcdf6a [inductor] benchmark fusion (#108193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193
Approved by: https://github.com/jansel
2023-10-26 04:14:22 +00:00
Guilherme Leobas
f97c2dabd9 Move negative index checking to common.py - Fix issue 97365 (#108690)
Fixes https://github.com/pytorch/pytorch/issues/97365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108690
Approved by: https://github.com/lezcano
2023-10-24 17:27:54 +00:00
PyTorch MergeBot
e62c887bab Revert "[inductor][BE] split triton_meta and inductor_meta (#111397)"
This reverts commit 070b94dc08.

Reverted https://github.com/pytorch/pytorch/pull/111397 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111397#issuecomment-1776282039))
2023-10-24 00:52:24 +00:00
David Berard
070b94dc08 [inductor][BE] split triton_meta and inductor_meta (#111397)
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.

This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.

Tests - wait for CI

Differential Revision: [D50442547](https://our.internmc.facebook.com/intern/diff/D50442547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111397
Approved by: https://github.com/shunting314, https://github.com/eellison
2023-10-23 21:38:21 +00:00
Jon Chuang
9c7f464eef [inductor]: Better debugging of can_fuse decisions with TORCH_LOGS=fusion (#110415)
Fixes https://github.com/pytorch/pytorch/issues/110393

Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: https://github.com/pytorch/pytorch/pull/110339)
```
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes =====
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions:
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42'))
[2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24'))
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG]
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes =====
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu)
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions:
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG]
[2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) =====

```

CC @jansel @ngimel @mlazos @shunting314 @peterbell10  as code owners

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110415
Approved by: https://github.com/mlazos
2023-10-13 00:36:45 +00:00
Jack Taylor
96f616a054 Revert tl.int1 casting change for ROCm to avoid hangs (#110531)
Seeing hangs on ROCm seemingly after this PR https://github.com/pytorch/pytorch/pull/110388
https://ossci-raw-job-status.s3.amazonaws.com/log/17381916785
`inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_bool Command took >30min, returning 124`

Conditionalising out of this while we investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110531
Approved by: https://github.com/peterbell10
2023-10-06 08:53:45 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
Peter Bell
dc794ec32c [dynamo] Trace through builtin abs (#110398)
In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do
the same in dynamo. This also adds `SymNode.__abs__` so we can trace through
indexing expressions involving `abs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-03 19:25:37 +00:00
Levy Zhao
7f0a659ccc Script to compare measured (trace) runtimes with estimated runtimes (#108037) (#109076)
Summary:

X-link: https://github.com/pytorch/benchmark/pull/1856

Reviewed By: xmfan, xuzhao9

Differential Revision: D48523883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076
Approved by: https://github.com/xw285cornell
2023-10-03 17:05:35 +00:00
Peter Bell
01b2f25ebd [inductor] Cast loads from boolean tensors to tl.int1 (#110388)
Triton currently loads pointer to `tl.int1` as `tl.int8`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110388
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2023-10-02 22:52:08 +00:00
chilli
13681382d5 Add heuristic for when evict_first should be set (and some other minor things) (#108841)
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
    return (a * b).sum(dim=-1)

N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```

This generates code like this: http://ix.io/4HFs

```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-01 17:06:12 +00:00
Jon Chuang
6aae636f69 chore(inductor): Simplify will_fusion_create_cycle and cleanup to node.ancestors (#109976)
recursive_predecessors == ancestors so rename.

Improve comments

Simplify `will_fusion_create_cycle` - make it easier to read and add detailed comments.

Diagram to illustrate clarification of shortcut.
![Inductor Deep Dive](https://github.com/pytorch/pytorch/assets/9093549/7a30e088-8a33-4a9c-a8a7-81199cd086e2)

CC: @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109976
Approved by: https://github.com/jansel
2023-09-27 20:48:53 +00:00
Peter Bell
92d86cd1ad [inductor] Fix triton compiler error in multilayer any (#109325)
Fixes #109196

When we have a split reduction and the tensor is not an even multiple of the split size,
we use `ops.masked` to pad to an even multiple. In the case here we generated:
```python
tmp5 = tl.where(mask, tmp4, 0)
```

which implicitly promotes our boolean value to `int32`. The fix is to give the default
value the same dtype as `result`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325
Approved by: https://github.com/lezcano
2023-09-26 12:29:29 +00:00
Ying Zhang
bbdce93571 Basic fp8 support in Inductor (#109168)
Add basic fp8 support in Inductor, including:
* Fix fp8 Triton codegen issues;
* Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from 10f59d8ce0/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp (L10).

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109168
Approved by: https://github.com/drisspg
2023-09-23 04:41:41 +00:00
Edward Z. Yang
3268b039ec Handle unbacked symints in Triton size hints (#109609)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609
Approved by: https://github.com/yf225
2023-09-22 03:16:53 +00:00
PyTorch MergeBot
169ae7540d Revert "Handle unbacked symints in Triton size hints (#109609)"
This reverts commit 654731a52b.

Reverted https://github.com/pytorch/pytorch/pull/109609 on behalf of https://github.com/ezyang due to this seems to regress HF perf ([comment](https://github.com/pytorch/pytorch/pull/109609#issuecomment-1729688883))
2023-09-21 14:25:42 +00:00
Edward Z. Yang
654731a52b Handle unbacked symints in Triton size hints (#109609)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109609
Approved by: https://github.com/yf225
ghstack dependencies: #109603
2023-09-20 18:03:54 +00:00
Sam Larsen
85d26f7868 [inductor] Enable mypy checking for torch/_inductor/codegen/triton.py (#109146)
Summary: enably mypy chcking for torch/_inductor/codegen/triton.py and make the minimum number of fixes / ignores to get the linter to pass

Test Plan: `lintrunner -a`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109146
Approved by: https://github.com/peterbell10
2023-09-19 23:01:03 +00:00
PyTorch MergeBot
800c665618 Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)"
This reverts commit 5976a08eea.

Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/peterbell10 due to This combined with #108803 uncovered a triton bug openai/triton#2298 ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1719811113))
2023-09-14 16:58:52 +00:00
Yang Chen
9cd4548f01 AOTInductor dynamic shape (#109012)
Summary: This PR adds dynamic-shape support for AOTInductor

* On the runtime/interface side, we added two structs, StaticDimInfo
and DynamicDimInfo, to hold values for static and dynamic dimensions,
respectively. Dynamic dimensions are tracked by an unordered map field
defined in AOTInductorModelBase. At inference time, the inference run
method will assign the current real dimensional value to each dynamic
dimension before executing any kernel.

* On the CUDA wrapper codegen side, we generate dynamic symbols
appropriately for shape computations. We simulate kernel launch grids
in the C++ land by re-using the grid functions from the Python world.
The returned grid configs, which may contain symbolic expressions,
are printed out in their C++ forms via the CppPrinter. Note that
when dynamic shapes are involved, we have to compute grid configs
for each kernel at runtime in the same way as we do for launching
the corresponding Triton kernel. Otherwise, we may end up with
memory-access failures or mis-computations caused by invalid indices
for fetching or storing data in device memory.

Differential Revision: D49100472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475
2023-09-14 08:00:30 +00:00
Ying Zhang
097fd43f8c [Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931)
This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931
Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng
ghstack dependencies: #107802, #107847, #107901
2023-09-12 17:44:38 +00:00
Peter Bell
5976a08eea [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.

Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
2023-09-11 18:44:10 +00:00
David Berard
ed7f9cac91 [inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449)
This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates.

Before:
<img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264">

After:
<img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565">

This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449
Approved by: https://github.com/jansel
2023-09-09 02:11:13 +00:00
PyTorch MergeBot
8ba23e48fa Revert "[inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)"
This reverts commit 53a27021c5.

Reverted https://github.com/pytorch/pytorch/pull/106581 on behalf of https://github.com/atalman due to Sorry for reverting your change, but it broke rocm CI ([comment](https://github.com/pytorch/pytorch/pull/106581#issuecomment-1710776610))
2023-09-07 21:13:42 +00:00
Peter Bell
53a27021c5 [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano
2023-09-07 17:40:45 +00:00
Shunting Zhang
7cb4bf675b [inductor] no-side-effect codegen (#107617)
Inductor kernel codegen previously have the following side effect:
- in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers
- during codegen, we do memory allocation/free.

These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617
Approved by: https://github.com/jansel
2023-08-31 00:25:17 +00:00
Shunting Zhang
556bfe7cb5 [inductor] let codegen not rely on node order (#107320)
We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided.  But each fusion pass (`fuse_nodes_once`) can also change node orders.  So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done).

Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a):
```
  benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed
```

Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107320
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-30 02:34:20 +00:00
Elias Ellison
d040d5b9ee Fix multi output layout error in indexing dtype calculation (#108085)
Differential Revision: [D48757829](https://our.internmc.facebook.com/intern/diff/D48757829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108085
Approved by: https://github.com/yanboliang, https://github.com/davidberard98, https://github.com/jansel, https://github.com/peterbell10
2023-08-29 05:43:44 +00:00
Michael Lazos
d4a99631dd Handle 2D blocking with foreach (#107840)
Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well.

Code when at least one dim matches:
[example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196)

Code when neither X or Y dim matches:
[example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840
Approved by: https://github.com/jansel
2023-08-26 11:02:46 +00:00
Shunting Zhang
95cacb7fa9 [reland][inductor] make thread order consistent with loop order (#107902)
This PR relands https://github.com/pytorch/pytorch/pull/106827 which get reverted because of causing compilation error for some ads model.

Yanbo provide a repro in one of the 14k model ( `pytest ./generated/test_KaiyangZhou_deep_person_reid.py -k test_044`). This is also the model I used to confirm the fix and come up with a unit test. In this model, we call `tritoin_heuristics.triton_config` with size_hints [2048, 2]. Previously this would result in a trition config with XBLOCK=2048 and YBLOCK=2 . But since we change the mapping between size_hints and XYZ dimension, we now generate a triton config with XBLOCK=2 and YBLOCK=2048.  This fails compilation since we set max YBLOCK to be 1024.

My fix is to make sure we never generate a triton config that exceeds the maximum block size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107902
Approved by: https://github.com/jansel
2023-08-26 02:56:20 +00:00
PyTorch MergeBot
d35d7de60e Revert "Handle 2D blocking with foreach (#107840)"
This reverts commit f87ffe473d.

Reverted https://github.com/pytorch/pytorch/pull/107840 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_2d_blocking is failing in trunk, probably a landrace as PR was green ([comment](https://github.com/pytorch/pytorch/pull/107840#issuecomment-1694009217))
2023-08-25 22:49:15 +00:00
Michael Lazos
f87ffe473d Handle 2D blocking with foreach (#107840)
Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well.

Code when at least one dim matches:
[example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196)

Code when neither X or Y dim matches:
[example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840
Approved by: https://github.com/jansel
2023-08-25 20:32:36 +00:00
Jackie (Jiaqi) Xu
398f4ae451 Back out "[inductor] make thread order consistent with loop order (#106827)" (#107796)
Summary:
D48295371 cause batch fusion failure, which will block mc proposals on all mc models.
e.g. cmf f470938179

Test Plan: Without revert, f469732293. With revert diff f472266199.

Differential Revision: D48610062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107796
Approved by: https://github.com/yanboliang
2023-08-23 18:02:54 +00:00
lezcano
2b6249e209 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-23 11:59:20 +00:00
PyTorch MergeBot
b282787409 Revert "Wrap indirect indexing on CUDA (#105055)"
This reverts commit 85c673e6b2.

Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))
2023-08-22 20:24:41 +00:00
David Berard
ba5eeed4ac [inductor] Add CPU-side profiler event for triton kernels w/ python wrapper (#106351)
This allows you to view the original kernel names (e.g. to reference the triton kernel implementation in the python wrapper code / TORCH_COMPILE_DEBUG logs). `torch._inductor.config.unique_kernel_names=True` does this too, but leaving unique_kernel_names=False will increase triton caching.

Another benefit to this approach is that we can attach additional information to this profiler event in the future. For example, we could attach input shapes/strides (i.e. record_shapes=True for profiler), or possibly paths to the files where the code was dumped.

<img width="435" alt="Screenshot 2023-07-31 at 5 34 25 PM" src="https://github.com/pytorch/pytorch/assets/5067123/839b752f-3907-4f29-9038-9d1822222b45">

^ in the trace above, the pink "triton_poi_fused_add_cos_sin_0" kernel is the new trace event which is added by this PR.

**Performance impact**: [dashboard run](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2000%3A52%3A06%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2000%3A52%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/davidberard98/216/orig&lCommit=90c4212a7993c3660e7ea53bcd9d21160be31d1a&rBranch=main&rCommit=35cca799ff42182a1b7f1ee4d0225ee879b7c924). There are some regressions, including a 1.72x -> 1.71x on huggingface and 1.30x -> 1.29x on dynamic; however, locally I can't reproduce the results on any of the individual models (differences look like they are within noise). I think the perf impact is likely < 1% overall.

Differential Revision: [D47941809](https://our.internmc.facebook.com/intern/diff/D47941809)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106351
Approved by: https://github.com/eellison, https://github.com/albanD
ghstack dependencies: #107195
2023-08-22 18:48:30 +00:00
lezcano
85c673e6b2 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-22 01:06:35 +00:00
Peter Bell
59c5424654 [inductor] Improve handling of index_expr with floating point dtypes (#105021)
I found that the upsample bicubic lowering was generating this line

```python
ops.index_expr(0.244094488188976*x0, torch.float32)
```

which is not good because triton's `ops.index_expr` expects integer expressions and dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105021
Approved by: https://github.com/lezcano
2023-08-21 03:09:53 +00:00
Peter Bell
18b1c2907d [inductor] Add ir.WelfordReduction with multiple outputs (#104725)
This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously.

Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725
Approved by: https://github.com/lezcano
2023-08-18 08:18:01 +00:00
Wang, Eikan
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
lezcano
6d899571d6 Simplify sign lowering in triton (#107051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039
2023-08-14 21:01:50 +00:00
Shunting Zhang
6696a75ea8 [inductor] make thread order consistent with loop order (#106827)
I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.

For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:

- before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d
- after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551

I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827
Approved by: https://github.com/jansel
2023-08-11 17:05:21 +00:00
Peter Bell
fa65df3745 [inductor] Type triton size arguments in the kernel index_dtype (#106870)
`JITFunction._key_of` uses the value of the argument to distinguish between
i32 and i64, but this fails if the value is used in indexing calculations where
the value exceeds `INT_MAX`.

Instead, we should use `index_dtype` which means all indexing calculations are
performed in the same dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870
Approved by: https://github.com/lezcano
ghstack dependencies: #106626
2023-08-10 21:07:25 +00:00
Yanbo Liang
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb35.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
Wang, Eikan
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00
Michael Lazos
5cbd3fc412 [Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008)
Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008
Approved by: https://github.com/jansel
2023-07-27 21:40:24 +00:00