Commit Graph

60 Commits

Author SHA1 Message Date
PyTorch MergeBot
f228b3977b Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)"
This reverts commit 77f32eb6cc.

Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined 67d1a77086. This is probably a landrace
2023-04-06 19:00:09 +00:00
Bin Bao
77f32eb6cc [inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)
Summary: when _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264
Approved by: https://github.com/ngimel
2023-04-06 15:59:55 +00:00
Bin Bao
348dcf51e5 [inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088)
Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one
file, which unifies the codegen for AOT and non-AOT mode. There will be
more refactoring for the AOT part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-06 15:59:55 +00:00
chunyuan
2987bc0758 Inductor cpp wrapper: support dynamic shapes (#97965)
1. Fixed dynamic shapes support in cpp_wrapper
   - fixed the cpp codegen of `size()` and `stride()`
   - fixed the cpp codegen of `ShapeAsConstantBuffer`
   - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar`

2. Enabled dynamic shapes tests for cpp_wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-05 07:02:30 +00:00
Bin Bao
96f548a1ac [inductor] Add an AOT mode for the Triton backend (#98214)
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.

This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.

Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
2023-04-03 22:19:18 +00:00
chunyuan
0c1f524b92 Inductor cpp wrapper: support MKLPackedLinear (#90755)
Invoke `torch.ops.mkl._mkl_linear` from c++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-03 04:07:38 +00:00
Shunting Zhang
13461e9767 [inductor] more cuda metrics in wrapper (#97723)
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark

This PR add those.

Example result from hf_Bert infernece graph:

```
  == triton_pointwise category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_poi_fused_gelu_6_0d1d                  0.48154  12.0     5.52%
triton_poi_fused_clone_1_0d1d2                0.29011  24.0     3.33%
triton_poi_fused_clone_2_0d1d2                0.17417  12.0     2.00%
triton_poi_fused_clone_4_0d1d2                0.10797  12.0     1.24%
Total                                         1.05379           12.08%

  == triton_persistent_reduction category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_per_fused__softmax__to_                0.97188  12.0     11.14%
triton_per_fused_add_native_la                0.37401  24.0     4.29%
triton_per_fused_gelu_native_l                0.02     1.0      0.23%
triton_per_fused_add_embedding                0.01718  1.0      0.20%
Total                                         1.38307           15.86%

  == unknown category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
ampere_fp16_s16816gemm_fp16_12                2.24514  24.0     25.74%
ampere_fp16_s16816gemm_fp16_25                1.39796  49.0     16.03%
void cutlass::Kernel<cutlass_8                1.36093  1.0      15.61%
ampere_fp16_s16816gemm_fp16_64                0.74591  12.0     8.55%
ampere_fp16_s16816gemm_fp16_12                0.61989  12.0     7.11%
Memset (Device)                               0.024    12.0     0.28%
void at::native::(anonymous na                0.01543  2.03     0.18%
void at::native::vectorized_el                0.00011  0.03     0.00%
Total                                         6.40937           73.49%

Percent of time when GPU is busy: 101.44%
```

Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).

## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```

Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```

2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).

NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.

```
benchmark_name           pointwise_percent    reduction_percent    persistent_reduction_percent    unknown_category_percent    GPU_busy_percent    wall_time_ms
-----------------------  -------------------  -------------------  ------------------------------  --------------------------  ------------------  --------------
resnet18                 19.73%               7.86%                4.81%                           41.25%                      73.65%              2.549ms
resnet18                 18.59%               7.13%                3.35%                           67.35%                      96.41%              3.467ms
resnet50                 29.57%               22.13%               2.07%                           51.68%                      105.46%             6.834ms
resnet50                 26.42%               15.27%               0.94%                           59.68%                      102.31%             13.346ms
vgg16                    26.23%               0.00%                0.00%                           74.20%                      100.43%             18.212ms
vgg16                    15.63%               5.61%                0.10%                           79.42%                      100.75%             33.485ms
BERT_pytorch             28.62%               4.82%                14.88%                          33.32%                      81.64%              7.162ms
BERT_pytorch             14.43%               13.41%               18.19%                          49.24%                      95.27%              10.395ms
densenet121              11.89%               2.14%                3.86%                           16.36%                      34.25%              16.531ms
densenet121              10.37%               2.06%                4.09%                           31.46%                      47.98%              16.934ms
hf_Bert                  23.94%               0.00%                29.88%                          46.09%                      99.90%              7.766ms
hf_Bert                  11.65%               10.54%               20.26%                          61.66%                      104.11%             11.892ms
nvidia_deeprecommender   42.92%               0.00%                0.00%                           56.75%                      99.67%              3.476ms
nvidia_deeprecommender   31.36%               3.44%                0.46%                           65.20%                      100.45%             3.872ms
alexnet                  30.99%               0.00%                0.00%                           69.16%                      100.14%             3.169ms
alexnet                  24.41%               4.83%                0.17%                           71.09%                      100.50%             4.709ms
mobilenet_v2             29.21%               27.79%               2.49%                           44.00%                      103.49%             10.160ms
mobilenet_v2             17.50%               15.05%               1.06%                           69.68%                      103.29%             20.715ms
resnext50_32x4d          18.96%               9.28%                2.31%                           28.79%                      59.33%              5.899ms
resnext50_32x4d          18.48%               11.01%               1.86%                           53.80%                      85.14%              7.167ms
mnasnet1_0               19.07%               14.52%               3.01%                           35.43%                      72.03%              6.028ms
mnasnet1_0               14.17%               12.00%               1.87%                           67.56%                      95.60%              9.225ms
squeezenet1_1            38.56%               0.00%                1.77%                           56.21%                      96.53%              2.221ms
squeezenet1_1            21.26%               7.57%                1.05%                           67.30%                      97.18%              4.942ms
timm_vision_transformer  17.05%               0.00%                18.80%                          65.79%                      101.64%             9.608ms
timm_vision_transformer  9.31%                9.07%                10.32%                          73.25%                      101.96%             16.814ms
```

## how to use
`python {compiled_module_wrapper.py} -p`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
2023-04-01 08:04:14 +00:00
Jason Ansel
1432a893ef Fix issue with single input cat (#97822)
Fixes #97695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-03-30 02:51:43 +00:00
Shunting Zhang
e1f44ee3b3 [inductor] correctly setup constant in the wrapper (#97571)
V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable.

Add 'global var_name' lines to maintain the same behavior as before.

Test:

Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix.

Thanks @ngimel  for report the issue with repro and @Chillee  for pointing out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571
Approved by: https://github.com/ngimel
2023-03-28 03:10:53 +00:00
Shunting Zhang
652592efa9 [inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.

E.g., I tried it for nvidia_deeprecommender's infernece graph.

Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.

<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">

next step I'll check why the triton mm kernels are not picked.

EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel
2023-03-27 21:54:25 +00:00
Jason Ansel
5f57b36318 Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558
Approved by: https://github.com/Chillee
2023-03-23 17:41:19 +00:00
Wang, Eikan
517a432d6e [Inductor] Enable CppWrapper to support BF16 (#97089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-22 05:54:09 +00:00
chunyuan
adcd1b3077 inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119
Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire
2023-03-21 01:40:09 +00:00
Bin Bao
931a4913b1 [inductor] Refactor memory management code in wrapper codegen (#96768)
Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768
Approved by: https://github.com/jansel
2023-03-16 18:36:35 +00:00
Zachary DeVito
3162f71787 [memory debugging] Extract frame information from inductor (#95753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753
Approved by: https://github.com/Chillee
2023-03-16 04:12:54 +00:00
Bin Bao
b60d6e246e [inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654)
Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654
Approved by: https://github.com/jansel
2023-03-14 22:55:12 +00:00
Bin Bao
f03db8d6cb [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822.
Solved the long compilation issue for inductor cpp tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-14 16:10:54 +00:00
Horace He
2a08a62777 Add extra metadata (as comments) to Inductor generated code (#96581)
New output
<img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581
Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym
2023-03-14 03:59:59 +00:00
Shunting Zhang
cc699c56dc reland #96248 [inductor] show performance for each autotune config for a kernel (#96458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458
Approved by: https://github.com/ngimel
2023-03-10 01:40:04 +00:00
PyTorch MergeBot
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e65.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
Bin Bao
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
Shunting Zhang
962b3f78bd [inductor] run all kernel benchmarks individually in a compiled module (#95845)
This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace.

Here are the steps:
1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training
```
2. From the output we will see 3 lines like
```
Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py
```
That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module.

3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file.

```
python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k
```

Example output:
<img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png">

Note: I use the first 10 characters of the hash to identify each kernel since
1. hash is easier to get in the code :)
2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash)

If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845
Approved by: https://github.com/Chillee
2023-03-06 21:30:33 +00:00
PyTorch MergeBot
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
Bin Bao
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
Will Constable
92a2107375 Support Inductor collectives with wait or collective outside graph (#95893)
Inductor implementations of collectives/wait must match
eager impls in _functional_collectives in terms of interacting
with _register_tensor_work API.  If they do, then splitting
a collective-wait pair so one half is in a compiled graph should
work fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
Jason Ansel
00ebbba623 Remove torch._inductor.config.triton.convolution (#95842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842
Approved by: https://github.com/ngimel
2023-03-02 17:44:41 +00:00
Shunting Zhang
5d29b68bbc [inductor] generate triton kernel benchmark (#95506)
A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module.

Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong.  In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b .  A compiled module contains the following part:
- various triton kernels
- a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor
- some utility code that generate random inputs and run the wrapper

The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning.

This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well.

The path to each compiled triton kernel is added as comment in the compiled module. E.g.
```
# kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py
triton__0 = async_compile.triton('''
import triton
import triton.language as tl
...
""")
````

Example command:
```
TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506
Approved by: https://github.com/Chillee
2023-03-01 18:29:07 +00:00
Edward Z. Yang
58648822b6 Handle int/float arguments for cpp codegen in inductor (#95533)
This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-28 03:57:35 +00:00
Horace He
01c861af14 Added utilities to instrument kernel bandwidth numbers (#95355)
Looks like

![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-24 17:51:11 +00:00
Nicolas Macchioni
dd7e2b7c0e [pt2][inductor] update choice caller hashes (#94853)
Summary:
update the hashing method for `ChoiceCaller` class.

`TritonTemplateCaller` objects will now be hashed to:
`{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}`

for example:
`triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o`

`ExternKernelCaller` objects will now be hashed to:
`{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}`

for example:
`addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf`

Test Plan: sandcastle

Differential Revision: D43285470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853
Approved by: https://github.com/jansel, https://github.com/bertmaher
2023-02-16 00:11:26 +00:00
Natalia Gimelshein
a5daea69fb teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-10 11:21:57 +00:00
PyTorch MergeBot
6007874bbb Revert "teach inductor to handle floor (#94341)"
This reverts commit e7df9aaec8.

Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related.  It fails on both PR and trunk e7df9aaec8
2023-02-09 19:31:08 +00:00
Natalia Gimelshein
e7df9aaec8 teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-09 17:09:35 +00:00
Eli Uriegas
567e6152da Revert "[inductor] fix crash issue when input is a view tensor (#90150)" (#94329)
Had to provide a merge conflict resolution due to conflicts with https://github.com/pytorch/pytorch/pull/94118

This was causing issues with internal tests that look similar to:
```
in clone_preserve_strides
    x.size(), x.stride(), x.storage_offset()
AttributeError: 'KeyedJaggedTensor' object has no attribute 'size'
```

See https://fburl.com/testinfra/nc0du2sp for more information

This reverts commit #90150

@jansel can you help @blzheng with re-landing this as a co-development diff?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94329
Approved by: https://github.com/jansel
2023-02-07 20:45:58 +00:00
Natalia Gimelshein
ca8450849b compute dynamic tensor shapes for indexing on the host (#93872)
Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is
```
x1 = (xindex // pks0) % 64
```
instead of
```
x1 = (xindex // (1 + (((((-1) + ks0) // 4))*((((-1) + ks0) // 4))) + (2*((((-1) + ks0) // 4))))) % 64
```
with `pks0` arg computed on the host
```
ps0 = (1 + ((((-1) + s2) // 4)))*(1 + ((((-1) + s2) // 4)))
```
It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g.
```
tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0*(((((-1) + ks0) // 32))*((((-1) + ks0) // 32)))) + (2*x0*((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32)
```
Unfortunately, `unet` which is one of the examples failing with floor does the latter:
```
tmp1 = ((-1)*(1/(((-1) + (floor(2.0*(ks0//16))))))) + ((1/(((-1) + (floor(2.0*(ks0//16))))))*(ks0 // 16))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872
Approved by: https://github.com/jansel
2023-02-03 09:58:39 +00:00
blzheng
a71395dd88 [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-03 04:54:14 +00:00
PyTorch MergeBot
5d259425fc Revert "[inductor] fix crash issue when input is a view tensor (#90150)"
This reverts commit b11ec270ba.

Reverted https://github.com/pytorch/pytorch/pull/90150 on behalf of https://github.com/clee2000 due to failing test_inplace_unsqueeze3 (__main__.CPUReproTests) https://github.com/pytorch/pytorch/actions/runs/4074618739/jobs/7020199369 b11ec270ba, marking as landrace cuz all jobs are green on pr
2023-02-02 17:06:34 +00:00
Will Constable
a14e3190e3 Mark buffers that reuse other buffers (#93329)
Provides a way at codegen time to emit code conditioned on
having a fresh allocation vs reusing an input.

- For collective ops, if reusing an input, a copy can be skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93329
Approved by: https://github.com/jansel
2023-02-02 14:22:26 +00:00
blzheng
b11ec270ba [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-02 12:49:26 +00:00
Edward Z. Yang
ca9ebf9e2b Delete dynamo_import and inductor_import (#93851)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93851
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-02 01:51:29 +00:00
Wu, Chunyuan
42633cf5f9 Inductor cpp wrapper: cache the loading of the kernel (#89742)
### Pitch
Cache the loaded kernel to reduce the overhead.

#### Code before:
```cpp
std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    auto kernel_cpp_0_lib = dlopen("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so", RTLD_NOW);
    assert(kernel_cpp_0_lib != nullptr);
    void (*kernel_cpp_0)(const float*,const float*,float*,float*);
    *(void **) (&kernel_cpp_0) = dlsym(kernel_cpp_0_lib, "kernel");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

#### Code after:
```cpp
template <typename KernelFunc>
KernelFunc load_cpp_kernel(const char* so_filename) {
    KernelFunc kernel_cpp;
    auto kernel_cpp_lib = dlopen(so_filename, RTLD_NOW);
    assert(kernel_cpp_lib != nullptr);
    *(void **) (&kernel_cpp) = dlsym(kernel_cpp_lib, "kernel");
    return kernel_cpp;
}

std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    static auto kernel_cpp_0 = load_cpp_kernel<void (*)(const float*,const float*,float*,float*)>("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89742
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-01 05:05:50 +00:00
Jason Ansel
9b173b87b2 Refactor away leftover import indirection (#92188)
This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188
Approved by: https://github.com/desertfire
2023-01-18 04:53:05 +00:00
Natalia Gimelshein
5625f521a4 generate set_device call to ensure context existence (#92055)
Hopefully Fixes https://github.com/pytorch/torchdynamo/issues/2026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92055
Approved by: https://github.com/wconstab
2023-01-12 17:23:49 +00:00
Peter Bell
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
Jason Ansel
7c1c239db1 [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575)
This reverts commit 94262efc7d to reland #91105 / #90738.

Fixes https://github.com/pytorch/torchdynamo/issues/2015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575
Approved by: https://github.com/ngimel
2023-01-11 00:08:03 +00:00
PyTorch MergeBot
94262efc7d Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)"
This reverts commit d6dd2e97da.

Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds
2022-12-21 00:02:38 +00:00
Jason Ansel
d6dd2e97da [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)
https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land.  This is an exact copy that PR without using ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105
Approved by: https://github.com/ngimel
2022-12-20 02:38:23 +00:00
Natalia Gimelshein
a10b3ce876 generate device context managers in inductor code (#90934)
Fixes https://github.com/pytorch/torchdynamo/issues/1717, https://github.com/pytorch/torchdynamo/issues/1990

<s>TODO: add test with multiple devices, figure out extra context initialization</s>

Problems:
<s>It still initializes context on 0-th device that it shouldn't, I'll take a look where that happens and fix before landing</s>
It adds a python device context manages, that is absurdly slow and takes ~2.5 us (should be nanoseconds). That's not a problem for real models, because it'll be called just once, but it is a bit of an inconvenience for microbenchmarking, we should make that context manager more performant (won't fix in this PR)
It still can have bugs for graphs that run on multiple devices and can have buffers incorrectly shared between multiple device by memory reuse, if that happens that'll need to be solved separately.

Generated code:
```
def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    with torch.cuda.device(1):
        buf0 = empty_strided((4, ), (1, ), device='cuda', dtype=torch.float32)
        stream1 = get_cuda_stream(1)
        triton_fused_div_0.run(arg0_1, arg1_1, buf0, 4, grid=grid(4), stream=stream1)
        del arg0_1
        del arg1_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90934
Approved by: https://github.com/wconstab
2022-12-16 18:03:39 +00:00
chunyuan
2ba5c1d7c4 Inductor cpp wrapper: change inputs args from tuple to vector (#90754)
## Pitch
Change input args type from `std::tuple` to `std::vector` to reduce the compilation time.

## Description
`std::tie()` takes quite a long time during the compilation when the input args number grows.

For example, for a graph from the `PegasusForConditionalGeneration` model with 318 input args, the compilation of `std::tie` for the args is about 10s. By changing to std::vector, the compilation time of arg assignment is reduced to less than 1s.

### Code before:
```cpp
at::Tensor call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    ...
    return buf0;
}
```

### Code after:
```cpp
at::Tensor call_0(std::vector<at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    arg0_1 = args[0];
    arg1_1 = args[1];
    ...
    return buf0;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90754
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 05:07:16 +00:00
chunyuan
fde5646f3d Inductor cpp wrapper: support bmm, mm, addmm extern call (#88667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88667
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-14 18:19:27 +00:00