Commit Graph

77 Commits

Author SHA1 Message Date
chunyuan
cc54da4877 Inductor cpp wrapper: fix FallbackKernel support (#100788)
Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs.

The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP.

The current support is limited and will raise `AssertionError` for unsupported cases.
The limitation includes:
- only support kernel that is not alias
- only support kernel the args and returns of which don't have `alias_info`
- only support output args to be a `Tensor`
- only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-05-15 00:45:44 +00:00
Aaron Gokaslan
dfe484a3b3 [BE]: Bugfix functorch and some generic typing improvements (#101337)
Fixes some typing bugs found with newer versions of mypy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337
Approved by: https://github.com/ezyang
2023-05-14 14:20:56 +00:00
Bin Bao
03433080e6 [inductor] Support FallbackKernel in cpp wrapper codegen (#100553)
Summary: This works well for ops without kwargs. For ops with kwargs, we
need to register ordered_kwargs_for_cpp_kernel for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100553
Approved by: https://github.com/jansel
2023-05-07 14:33:53 +00:00
Edward Z. Yang
4101de342b Type torch._inductor.codegen.wrapper (#100657)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100657
Approved by: https://github.com/voznesenskym
2023-05-05 16:19:23 +00:00
Edward Z. Yang
f093ee1722 Prevent Triton from getting eagerly imported when importing torch._inductor (#100374)
This makes 'import torch._inductor.utils' go from 3.5s to 2.1s

See also https://github.com/openai/triton/issues/1599

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:12 +00:00
Natalia Gimelshein
ff29722364 [inductor] Prevent reusing aliased buffers if aliases still have uses (#100332)
Fixes #100314
In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332
Approved by: https://github.com/jansel
2023-05-02 04:05:16 +00:00
Edward Z. Yang
2d8deffc1e Refactor repro/minifier into CLI; add analyze (#100226)
This is a two part PR; I can split it if you really want me to.

The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments.

The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro.

<img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png">

This facility is oriented towards accuracy debugging. It does several things:

1. It will run your model twice and check for nondeterminism in inductor/float64, *even* on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy.
2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64.

Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.)

Some other minor improvements:

* MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging
* We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from
* same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch
* counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface
* torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long
* torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226
Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305
2023-05-01 11:12:38 +00:00
Edward Z. Yang
beb7f79517 Fix intermediate hooks on inplace buffers, enable it in testing (#100322)
Fixes https://github.com/pytorch/pytorch/issues/100312

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322
Approved by: https://github.com/ngimel
2023-04-30 13:34:44 +00:00
Edward Z. Yang
54c0edf6da Track exact origin_node on best effort basis (#100110)
Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node.

This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph.

Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future.

The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.)

I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110
Approved by: https://github.com/voznesenskym
2023-04-28 04:15:27 +00:00
Bin Bao
afa9d10ed6 [inductor] Support mixed device in cpp wrapper (#99950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99950
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-26 16:26:56 +00:00
Bin Bao
efded3f3e9 [inductor] Add cpp_wrapper support for FallbackKernel (#99887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99887
Approved by: https://github.com/ngimel
2023-04-26 01:03:53 +00:00
Bin Bao
e43918b93a [inductor] Fix AOTInductor (#99203)
Summary: Fix the broken AOTInductor flow and add a smoketest on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203
Approved by: https://github.com/jansel
2023-04-25 14:42:12 +00:00
Bin Bao
e5501a967e [inductor] Support IndexPutFallback in cpp_wrapper (#98972)
Summary:
1) Make the fallback index_put generate the right cpp code in cpp_wapper
2) Add a --cpp-wrapper option to common.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-13 15:41:03 +00:00
Bin Bao
0c0e5c574e [inductor] Consolidate constant_args and cpp_constant_args (#98742)
Summary: Refactor code to simplify the logic. Support convolution as an
extern call in CudaWrapperCodeGen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98742
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-12 11:59:08 +00:00
Bin Bao
ff9e34fb35 [inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741)
Summary: refactor to simplify the wrapper codegen logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98741
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/ngimel
2023-04-12 11:59:08 +00:00
Peter Bell
48397cddd7 [inductor] Fix benchmark_compiled_module codegen with CppWrapperCodeGen (#98608)
The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a
tuple `(2, 17)`. Here is a complete example from master:

```python
def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8)
    return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608
Approved by: https://github.com/ngimel
2023-04-08 00:55:51 +00:00
Bin Bao
152d65ae1d [reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534)
Summary: This is a reland of #98264.

When _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534
Approved by: https://github.com/huydhn
2023-04-07 02:04:03 +00:00
PyTorch MergeBot
f228b3977b Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)"
This reverts commit 77f32eb6cc.

Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined 67d1a77086. This is probably a landrace
2023-04-06 19:00:09 +00:00
Bin Bao
77f32eb6cc [inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)
Summary: when _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264
Approved by: https://github.com/ngimel
2023-04-06 15:59:55 +00:00
Bin Bao
348dcf51e5 [inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088)
Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one
file, which unifies the codegen for AOT and non-AOT mode. There will be
more refactoring for the AOT part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-06 15:59:55 +00:00
chunyuan
2987bc0758 Inductor cpp wrapper: support dynamic shapes (#97965)
1. Fixed dynamic shapes support in cpp_wrapper
   - fixed the cpp codegen of `size()` and `stride()`
   - fixed the cpp codegen of `ShapeAsConstantBuffer`
   - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar`

2. Enabled dynamic shapes tests for cpp_wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-05 07:02:30 +00:00
Bin Bao
96f548a1ac [inductor] Add an AOT mode for the Triton backend (#98214)
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.

This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.

Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
2023-04-03 22:19:18 +00:00
chunyuan
0c1f524b92 Inductor cpp wrapper: support MKLPackedLinear (#90755)
Invoke `torch.ops.mkl._mkl_linear` from c++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-03 04:07:38 +00:00
Shunting Zhang
13461e9767 [inductor] more cuda metrics in wrapper (#97723)
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark

This PR add those.

Example result from hf_Bert infernece graph:

```
  == triton_pointwise category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_poi_fused_gelu_6_0d1d                  0.48154  12.0     5.52%
triton_poi_fused_clone_1_0d1d2                0.29011  24.0     3.33%
triton_poi_fused_clone_2_0d1d2                0.17417  12.0     2.00%
triton_poi_fused_clone_4_0d1d2                0.10797  12.0     1.24%
Total                                         1.05379           12.08%

  == triton_persistent_reduction category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_per_fused__softmax__to_                0.97188  12.0     11.14%
triton_per_fused_add_native_la                0.37401  24.0     4.29%
triton_per_fused_gelu_native_l                0.02     1.0      0.23%
triton_per_fused_add_embedding                0.01718  1.0      0.20%
Total                                         1.38307           15.86%

  == unknown category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
ampere_fp16_s16816gemm_fp16_12                2.24514  24.0     25.74%
ampere_fp16_s16816gemm_fp16_25                1.39796  49.0     16.03%
void cutlass::Kernel<cutlass_8                1.36093  1.0      15.61%
ampere_fp16_s16816gemm_fp16_64                0.74591  12.0     8.55%
ampere_fp16_s16816gemm_fp16_12                0.61989  12.0     7.11%
Memset (Device)                               0.024    12.0     0.28%
void at::native::(anonymous na                0.01543  2.03     0.18%
void at::native::vectorized_el                0.00011  0.03     0.00%
Total                                         6.40937           73.49%

Percent of time when GPU is busy: 101.44%
```

Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).

## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```

Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```

2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).

NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.

```
benchmark_name           pointwise_percent    reduction_percent    persistent_reduction_percent    unknown_category_percent    GPU_busy_percent    wall_time_ms
-----------------------  -------------------  -------------------  ------------------------------  --------------------------  ------------------  --------------
resnet18                 19.73%               7.86%                4.81%                           41.25%                      73.65%              2.549ms
resnet18                 18.59%               7.13%                3.35%                           67.35%                      96.41%              3.467ms
resnet50                 29.57%               22.13%               2.07%                           51.68%                      105.46%             6.834ms
resnet50                 26.42%               15.27%               0.94%                           59.68%                      102.31%             13.346ms
vgg16                    26.23%               0.00%                0.00%                           74.20%                      100.43%             18.212ms
vgg16                    15.63%               5.61%                0.10%                           79.42%                      100.75%             33.485ms
BERT_pytorch             28.62%               4.82%                14.88%                          33.32%                      81.64%              7.162ms
BERT_pytorch             14.43%               13.41%               18.19%                          49.24%                      95.27%              10.395ms
densenet121              11.89%               2.14%                3.86%                           16.36%                      34.25%              16.531ms
densenet121              10.37%               2.06%                4.09%                           31.46%                      47.98%              16.934ms
hf_Bert                  23.94%               0.00%                29.88%                          46.09%                      99.90%              7.766ms
hf_Bert                  11.65%               10.54%               20.26%                          61.66%                      104.11%             11.892ms
nvidia_deeprecommender   42.92%               0.00%                0.00%                           56.75%                      99.67%              3.476ms
nvidia_deeprecommender   31.36%               3.44%                0.46%                           65.20%                      100.45%             3.872ms
alexnet                  30.99%               0.00%                0.00%                           69.16%                      100.14%             3.169ms
alexnet                  24.41%               4.83%                0.17%                           71.09%                      100.50%             4.709ms
mobilenet_v2             29.21%               27.79%               2.49%                           44.00%                      103.49%             10.160ms
mobilenet_v2             17.50%               15.05%               1.06%                           69.68%                      103.29%             20.715ms
resnext50_32x4d          18.96%               9.28%                2.31%                           28.79%                      59.33%              5.899ms
resnext50_32x4d          18.48%               11.01%               1.86%                           53.80%                      85.14%              7.167ms
mnasnet1_0               19.07%               14.52%               3.01%                           35.43%                      72.03%              6.028ms
mnasnet1_0               14.17%               12.00%               1.87%                           67.56%                      95.60%              9.225ms
squeezenet1_1            38.56%               0.00%                1.77%                           56.21%                      96.53%              2.221ms
squeezenet1_1            21.26%               7.57%                1.05%                           67.30%                      97.18%              4.942ms
timm_vision_transformer  17.05%               0.00%                18.80%                          65.79%                      101.64%             9.608ms
timm_vision_transformer  9.31%                9.07%                10.32%                          73.25%                      101.96%             16.814ms
```

## how to use
`python {compiled_module_wrapper.py} -p`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
2023-04-01 08:04:14 +00:00
Jason Ansel
1432a893ef Fix issue with single input cat (#97822)
Fixes #97695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-03-30 02:51:43 +00:00
Shunting Zhang
e1f44ee3b3 [inductor] correctly setup constant in the wrapper (#97571)
V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable.

Add 'global var_name' lines to maintain the same behavior as before.

Test:

Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix.

Thanks @ngimel  for report the issue with repro and @Chillee  for pointing out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571
Approved by: https://github.com/ngimel
2023-03-28 03:10:53 +00:00
Shunting Zhang
652592efa9 [inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.

E.g., I tried it for nvidia_deeprecommender's infernece graph.

Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.

<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">

next step I'll check why the triton mm kernels are not picked.

EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel
2023-03-27 21:54:25 +00:00
Jason Ansel
5f57b36318 Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558
Approved by: https://github.com/Chillee
2023-03-23 17:41:19 +00:00
Wang, Eikan
517a432d6e [Inductor] Enable CppWrapper to support BF16 (#97089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-22 05:54:09 +00:00
chunyuan
adcd1b3077 inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119
Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire
2023-03-21 01:40:09 +00:00
Bin Bao
931a4913b1 [inductor] Refactor memory management code in wrapper codegen (#96768)
Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768
Approved by: https://github.com/jansel
2023-03-16 18:36:35 +00:00
Zachary DeVito
3162f71787 [memory debugging] Extract frame information from inductor (#95753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753
Approved by: https://github.com/Chillee
2023-03-16 04:12:54 +00:00
Bin Bao
b60d6e246e [inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654)
Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654
Approved by: https://github.com/jansel
2023-03-14 22:55:12 +00:00
Bin Bao
f03db8d6cb [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822.
Solved the long compilation issue for inductor cpp tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-14 16:10:54 +00:00
Horace He
2a08a62777 Add extra metadata (as comments) to Inductor generated code (#96581)
New output
<img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581
Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym
2023-03-14 03:59:59 +00:00
Shunting Zhang
cc699c56dc reland #96248 [inductor] show performance for each autotune config for a kernel (#96458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458
Approved by: https://github.com/ngimel
2023-03-10 01:40:04 +00:00
PyTorch MergeBot
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e65.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
Bin Bao
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
Shunting Zhang
962b3f78bd [inductor] run all kernel benchmarks individually in a compiled module (#95845)
This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace.

Here are the steps:
1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training
```
2. From the output we will see 3 lines like
```
Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py
```
That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module.

3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file.

```
python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k
```

Example output:
<img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png">

Note: I use the first 10 characters of the hash to identify each kernel since
1. hash is easier to get in the code :)
2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash)

If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845
Approved by: https://github.com/Chillee
2023-03-06 21:30:33 +00:00
PyTorch MergeBot
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
Bin Bao
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
Will Constable
92a2107375 Support Inductor collectives with wait or collective outside graph (#95893)
Inductor implementations of collectives/wait must match
eager impls in _functional_collectives in terms of interacting
with _register_tensor_work API.  If they do, then splitting
a collective-wait pair so one half is in a compiled graph should
work fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
Jason Ansel
00ebbba623 Remove torch._inductor.config.triton.convolution (#95842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842
Approved by: https://github.com/ngimel
2023-03-02 17:44:41 +00:00
Shunting Zhang
5d29b68bbc [inductor] generate triton kernel benchmark (#95506)
A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module.

Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong.  In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b .  A compiled module contains the following part:
- various triton kernels
- a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor
- some utility code that generate random inputs and run the wrapper

The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning.

This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well.

The path to each compiled triton kernel is added as comment in the compiled module. E.g.
```
# kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py
triton__0 = async_compile.triton('''
import triton
import triton.language as tl
...
""")
````

Example command:
```
TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506
Approved by: https://github.com/Chillee
2023-03-01 18:29:07 +00:00
Edward Z. Yang
58648822b6 Handle int/float arguments for cpp codegen in inductor (#95533)
This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-28 03:57:35 +00:00
Horace He
01c861af14 Added utilities to instrument kernel bandwidth numbers (#95355)
Looks like

![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-24 17:51:11 +00:00
Nicolas Macchioni
dd7e2b7c0e [pt2][inductor] update choice caller hashes (#94853)
Summary:
update the hashing method for `ChoiceCaller` class.

`TritonTemplateCaller` objects will now be hashed to:
`{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}`

for example:
`triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o`

`ExternKernelCaller` objects will now be hashed to:
`{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}`

for example:
`addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf`

Test Plan: sandcastle

Differential Revision: D43285470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853
Approved by: https://github.com/jansel, https://github.com/bertmaher
2023-02-16 00:11:26 +00:00
Natalia Gimelshein
a5daea69fb teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-10 11:21:57 +00:00
PyTorch MergeBot
6007874bbb Revert "teach inductor to handle floor (#94341)"
This reverts commit e7df9aaec8.

Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related.  It fails on both PR and trunk e7df9aaec8
2023-02-09 19:31:08 +00:00
Natalia Gimelshein
e7df9aaec8 teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-09 17:09:35 +00:00