Commit Graph

85 Commits

Author SHA1 Message Date
Animesh Jain
58d2c66a70 [activation checkpointing] Higher order functional rng op wrappers (#102934)
Introduces two higher order operators
* run_and_save_rng_state - Saves the current rng state and then runs the op.
* run_with_rng_state - Runs the op with the rng state supplied as an input

Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-06-12 22:54:17 +00:00
Shunting Zhang
86c7652503 [inductor] layout optimization for conv (#99773)
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773
Approved by: https://github.com/jansel
2023-06-02 21:08:18 +00:00
Michael Lazos
80f7264804 Foreach kernel codegen in inductor (#99975)
[design doc](https://docs.google.com/document/d/1JLr5yMAR8TuKW78ixKeqzfDHhcazwxKo_JXQnP_-wyY/edit?kh_source=GDOCS#heading=h.8x4z4mmet3im)

Add foreach kernel codegen for a single overload of foreach add in Inductor. Coverage will expand to more ops in subsequent PRs.

[example](https://gist.github.com/mlazos/9606fe64100ea2a5ec8265df1739fbe2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99975
Approved by: https://github.com/jansel
2023-05-25 21:48:41 +00:00
Peter Bell
ce42010722 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-24 22:17:32 +00:00
PyTorch MergeBot
5147fe4969 Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)"
This reverts commit b9721bd705.

Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))
2023-05-23 23:06:21 +00:00
Peter Bell
b9721bd705 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-22 20:39:18 +00:00
Jiong Gong
6f7ebcdcd8 [inductor] enable descriptive name for cpp kernels (#101330)
This PR enables the descriptive name for cpp kernels similar to the triton kernel name. A new configuration `config.cpp.descriptive_names` is added similar to that of triton. The kernel name follows the format: `cpp_<fused_name>_<id>`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101330
Approved by: https://github.com/XiaobingSuper, https://github.com/jansel
2023-05-16 06:48:11 +00:00
chunyuan
1faef895ca Inductor cpp wrapper: support sympy.Expr as input (#101257)
Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel
2023-05-15 23:57:28 +00:00
Edward Z. Yang
2c786961b7 Towards making torch._inductor.ir typed (#100712)
This PR just contains some mild gyrations necessary to appease mypy.
However, it is not complete; there are a number of legitimate bugs
and mistyping that I need to work out before I can actually turn this
on.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100712
Approved by: https://github.com/ngimel
2023-05-12 00:07:33 +00:00
Jason Ansel
e3d783c013 [inductor] Cleanup strip_last_size logic (#100305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305
Approved by: https://github.com/ngimel
2023-05-05 23:10:47 +00:00
Angela Yi
3c5ec6af14 Partition modules (#98628)
Added helper functions to match nodes in the graph that are decomposed from their source (leaf modules, or functional ops), as a result of dynamo tracing.

`get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]`

Args:
* graph: The graph we want to partition
* wanted_sources: List of sources of nodes that were decomposed from this source. This can be a function (ex. torch.nn.functional.linear) or a leaf module type (ex. torch.nn.Linear)

Returns:
* Dictionary mapping sources (ex. torch.nn.modules.linear.Linear) to a list of SourcePartitions that correspond to the list of nodes that were flattened from a module of that type.

```
@dataclass
class SourcePartition():
    # Nodes in a particular partition
    nodes: List[Node]
    # Module type
    module_type: Type
    # Nodes in the graph that are needed as inputs to the partition
    input_nodes: List[Node] = field(default_factory=list)
    # Nodes in the partition that are being used by nodes outside of the partition
    output_nodes: List[Node] = field(default_factory=list)
    # Parameters that are being used
    params: List[str] = field(default_factory=list)
```

Example:

Original:
```
x -> linear -> linear -> relu -> linear
```
Traced graph:
```
.graph():
    %arg0 : [#users=1] = placeholder[target=arg0]
    %_param_constant0 : [#users=1] = get_attr[target=_param_constant0]
    %t_default : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0,), kwargs = {})
    %_param_constant1 : [#users=1] = get_attr[target=_param_constant1]
    %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {})
    %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0]
    %t_default_1 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0_1,), kwargs = {})
    %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1]
    %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {})
    %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {})
    %_param_constant2 : [#users=1] = get_attr[target=_param_constant2]
    %t_default_2 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant2,), kwargs = {})
    %_param_constant3 : [#users=1] = get_attr[target=_param_constant3]
    %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {})
    return [addmm_default_2]
```
Result of `get_module_partitions`:
```
{<class 'torch.nn.modules.linear.Linear'>: [
    ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]),
    ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]),
    ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])],

 <class 'torch.nn.modules.activation.ReLU'>: [
    ModulePartition(nodes=[relu_default], module_type=<class 'torch.nn.modules.activation.ReLU'>, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]}
```

Also added helper function to check if two module partitions are connected:
`check_subgraphs_connected(subgraph1: SourcePartition, subgraph2: SourcePartition) -> bool`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98628
Approved by: https://github.com/cccclai
2023-05-03 23:31:56 +00:00
PyTorch MergeBot
34e90b8df1 Revert "[inductor] Cleanup strip_last_size logic (#100305)"
This reverts commit de7793d577.

Reverted https://github.com/pytorch/pytorch/pull/100305 on behalf of https://github.com/jansel due to causes IMA errors on huggingface ([comment](https://github.com/pytorch/pytorch/pull/100305#issuecomment-1532317310))
2023-05-03 00:42:48 +00:00
Jason Ansel
de7793d577 [inductor] Cleanup strip_last_size logic (#100305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305
Approved by: https://github.com/ngimel
2023-05-02 23:46:26 +00:00
Edward Z. Yang
f093ee1722 Prevent Triton from getting eagerly imported when importing torch._inductor (#100374)
This makes 'import torch._inductor.utils' go from 3.5s to 2.1s

See also https://github.com/openai/triton/issues/1599

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:12 +00:00
Yanbo Liang
08376cc546 [Inductor] Fix rand_like with kwargs device of str type (#99673)
Fixes #99632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99673
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-21 20:33:14 +00:00
Shunting Zhang
418a9fb9d8 [reland][inductor] coordinate descent tuning upon max-autotune (#99594)
Reland https://github.com/pytorch/pytorch/pull/97203 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594
Approved by: https://github.com/jansel
2023-04-20 19:55:52 +00:00
PyTorch MergeBot
4aedb8e116 Revert "[inductor] coordinate descent tuning upon max-autotune (#97203)"
This reverts commit 52ecc3274b.

Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk
2023-04-19 02:33:02 +00:00
Shunting Zhang
52ecc3274b [inductor] coordinate descent tuning upon max-autotune (#97203)
Command to run max autotune baseline:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)
```

Command to do coordinate descent autotuning:
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)
```

Explanation of the envvars show up on the command:
```
- TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning
- TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions
- TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune
- TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional.
```

Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0

Some highlights
- We improve 2.2% further upon max-autotune on average (geomean)
- timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup
- We have descent speedup on transformer models
  - BERT_pytorch:  1.056x
  - timm_vision_transformer: 1.04x
  - hf_Bert: 1.030x
- For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less
  - resnet18: 1.021x
  - resnet50: 1.014x
  - resnet152: 1.005x

This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be
- we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not.
- pick good config without benchmarking (e.g. Natalia mentioned checking register spill)
- try the idea on matmul so we know what's the potential there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203
Approved by: https://github.com/ngimel
2023-04-19 00:17:10 +00:00
Shunting Zhang
694ed70e01 [inductor][easy] create a wrap for triton do_bench function (#99216)
triton PR https://github.com/openai/triton/pull/1513 change the interface of do_bench function. The quantile fields name is changed from 'percentiles' to 'quantiles' and its default value is changed from from (0.5, 0.2, 0.8) to None. This break some inductor code since a caller expects a tuple may get a item.

Add a wrapper to maintain the same behavior for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99216
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-18 00:52:00 +00:00
Kazuaki Ishizaki
f011db345f Fix typos under torch/_inductor directory (#97592)
This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97592
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-04-10 22:53:18 +00:00
Jason Ansel
8fee46693c Fused attention patterns (#97741)
Patterns based on https://github.com/pytorch/pytorch/pull/94729 mainly as a forcing function for implementing joint graph replacements.

Up until now, we had two places to do pattern matching
1) Pre-grad has janky infra (graph not normalized or functional), but is
   desirable for many types of passes where you want your change to
   affect grad formulas.
2) Post-grad has good infra, but cant change grad formulas.

This PR adds a third place to do pattern matching: the joint
forward+backwards graph.  The idea is to take the patterns and lower
them to a joint graph and replace both the forwards+backwards before
we partition them.  This allows us to do something similar to pre-grad
transforms, but run after normalization and functionalization.

Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97741
Approved by: https://github.com/Chillee
2023-04-10 00:35:22 +00:00
Yu Guo
edebe413d3 [inductor] fix scatter fallback and fallback in deterministic mode (#98339)
Fixes https://github.com/pytorch/pytorch/issues/93537

add `ir.ScatterFallback` to handle the mutation correctly of scatter/scatter_reduce fallback, also handle the case that `src` is a scalar, and lastly fallback in deterministic mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98339
Approved by: https://github.com/jansel
2023-04-06 19:43:17 +00:00
Yanbo Liang
ccc27bc361 [Inductor] Fix convolution lowering if stride or padding or dilation is 1 element list (#98448)
Fixes error from 14k github models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98448
Approved by: https://github.com/ngimel
2023-04-06 10:40:06 +00:00
Shunting Zhang
13461e9767 [inductor] more cuda metrics in wrapper (#97723)
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark

This PR add those.

Example result from hf_Bert infernece graph:

```
  == triton_pointwise category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_poi_fused_gelu_6_0d1d                  0.48154  12.0     5.52%
triton_poi_fused_clone_1_0d1d2                0.29011  24.0     3.33%
triton_poi_fused_clone_2_0d1d2                0.17417  12.0     2.00%
triton_poi_fused_clone_4_0d1d2                0.10797  12.0     1.24%
Total                                         1.05379           12.08%

  == triton_persistent_reduction category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_per_fused__softmax__to_                0.97188  12.0     11.14%
triton_per_fused_add_native_la                0.37401  24.0     4.29%
triton_per_fused_gelu_native_l                0.02     1.0      0.23%
triton_per_fused_add_embedding                0.01718  1.0      0.20%
Total                                         1.38307           15.86%

  == unknown category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
ampere_fp16_s16816gemm_fp16_12                2.24514  24.0     25.74%
ampere_fp16_s16816gemm_fp16_25                1.39796  49.0     16.03%
void cutlass::Kernel<cutlass_8                1.36093  1.0      15.61%
ampere_fp16_s16816gemm_fp16_64                0.74591  12.0     8.55%
ampere_fp16_s16816gemm_fp16_12                0.61989  12.0     7.11%
Memset (Device)                               0.024    12.0     0.28%
void at::native::(anonymous na                0.01543  2.03     0.18%
void at::native::vectorized_el                0.00011  0.03     0.00%
Total                                         6.40937           73.49%

Percent of time when GPU is busy: 101.44%
```

Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).

## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```

Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```

2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).

NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.

```
benchmark_name           pointwise_percent    reduction_percent    persistent_reduction_percent    unknown_category_percent    GPU_busy_percent    wall_time_ms
-----------------------  -------------------  -------------------  ------------------------------  --------------------------  ------------------  --------------
resnet18                 19.73%               7.86%                4.81%                           41.25%                      73.65%              2.549ms
resnet18                 18.59%               7.13%                3.35%                           67.35%                      96.41%              3.467ms
resnet50                 29.57%               22.13%               2.07%                           51.68%                      105.46%             6.834ms
resnet50                 26.42%               15.27%               0.94%                           59.68%                      102.31%             13.346ms
vgg16                    26.23%               0.00%                0.00%                           74.20%                      100.43%             18.212ms
vgg16                    15.63%               5.61%                0.10%                           79.42%                      100.75%             33.485ms
BERT_pytorch             28.62%               4.82%                14.88%                          33.32%                      81.64%              7.162ms
BERT_pytorch             14.43%               13.41%               18.19%                          49.24%                      95.27%              10.395ms
densenet121              11.89%               2.14%                3.86%                           16.36%                      34.25%              16.531ms
densenet121              10.37%               2.06%                4.09%                           31.46%                      47.98%              16.934ms
hf_Bert                  23.94%               0.00%                29.88%                          46.09%                      99.90%              7.766ms
hf_Bert                  11.65%               10.54%               20.26%                          61.66%                      104.11%             11.892ms
nvidia_deeprecommender   42.92%               0.00%                0.00%                           56.75%                      99.67%              3.476ms
nvidia_deeprecommender   31.36%               3.44%                0.46%                           65.20%                      100.45%             3.872ms
alexnet                  30.99%               0.00%                0.00%                           69.16%                      100.14%             3.169ms
alexnet                  24.41%               4.83%                0.17%                           71.09%                      100.50%             4.709ms
mobilenet_v2             29.21%               27.79%               2.49%                           44.00%                      103.49%             10.160ms
mobilenet_v2             17.50%               15.05%               1.06%                           69.68%                      103.29%             20.715ms
resnext50_32x4d          18.96%               9.28%                2.31%                           28.79%                      59.33%              5.899ms
resnext50_32x4d          18.48%               11.01%               1.86%                           53.80%                      85.14%              7.167ms
mnasnet1_0               19.07%               14.52%               3.01%                           35.43%                      72.03%              6.028ms
mnasnet1_0               14.17%               12.00%               1.87%                           67.56%                      95.60%              9.225ms
squeezenet1_1            38.56%               0.00%                1.77%                           56.21%                      96.53%              2.221ms
squeezenet1_1            21.26%               7.57%                1.05%                           67.30%                      97.18%              4.942ms
timm_vision_transformer  17.05%               0.00%                18.80%                          65.79%                      101.64%             9.608ms
timm_vision_transformer  9.31%                9.07%                10.32%                          73.25%                      101.96%             16.814ms
```

## how to use
`python {compiled_module_wrapper.py} -p`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
2023-04-01 08:04:14 +00:00
Yu Guo
1f71ac785c [RFC][inductor][index_put] fallback to aten in torch deterministic mode (#96898)
Fixes #93537
fallback to aten for index_put and scatter ops in torch deterministic mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96898
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-29 19:28:37 +00:00
Shunting Zhang
652592efa9 [inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.

E.g., I tried it for nvidia_deeprecommender's infernece graph.

Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.

<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">

next step I'll check why the triton mm kernels are not picked.

EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel
2023-03-27 21:54:25 +00:00
Christian Puhrsch
9d37cefcb0 Resubmit _int_mm (#96685)
Avoids any changes to gemm_and_bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685
Approved by: https://github.com/drisspg, https://github.com/ngimel
2023-03-27 16:14:07 +00:00
blzheng
39c8188194 Inductor: fall back bernoulli on cpu (#97002)
data type: float32
Input size: torch.Size([64, 4, 128, 128])
single socket (32cores):
```
Before: bernoulli 0.001327775239944458 s      dropout 0.0014216173489888509 s
After:  bernoulli 0.0002424612840016683 s     dropout 0.00039757410685221353 s
```

single core:
```
Before: bernoulli 0.04154032731056213 s      dropout 0.04382548745473226 s
After: bernoulli 0.006143261671066284 s      dropout 0.0065830423831939695 s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97002
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-24 22:13:51 +00:00
Jason Ansel
9370f253e3 [inductor] Rewrite convolution triton templates (#95556)
Fixes #95775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556
Approved by: https://github.com/Chillee, https://github.com/ngimel
2023-03-22 18:12:23 +00:00
Horace He
6dded5d63e Fixes warning to refer to SMs instead of Cuda Cores (#97224)
Fixes https://github.com/pytorch/pytorch/issues/97179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97224
Approved by: https://github.com/eellison, https://github.com/voznesenskym
2023-03-21 22:37:31 +00:00
Bin Bao
ea9194a4f2 [inductor] Make the original ATen info dumped in alphabetical order (#97261)
Summary: To avoid a lot of noises when comparing output_code.py from two
runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97261
Approved by: https://github.com/Chillee
2023-03-21 20:34:49 +00:00
Shunting Zhang
13398d8b95 [inductor] improve bandwidth computation (#97057)
When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057
Approved by: https://github.com/Chillee
2023-03-20 20:30:46 +00:00
Shunting Zhang
8ce296ae2c [ez][inductor] show kernel category in kernel benchmark result (#96991)
I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line:
- POI for pointwise
- RED for reduction
- PER for persistent_reduction

<img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991
Approved by: https://github.com/Chillee
2023-03-17 17:02:43 +00:00
Christian Puhrsch
0a53c9624a Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)" (#96885)
Summary:
Backing out  _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885
Approved by: https://github.com/drisspg
2023-03-16 05:32:55 +00:00
Zachary DeVito
3162f71787 [memory debugging] Extract frame information from inductor (#95753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753
Approved by: https://github.com/Chillee
2023-03-16 04:12:54 +00:00
Edward Z. Yang
3606f59366 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-16 02:54:18 +00:00
Yanbo Liang
e7d795dccd [Inductor] aten.{avg_pool2d/max_pool2d_with_indices} arguments can be 1 element tuple (#96727)
Fixes failure from 14k github models: ```pytest ./generated/test_ProGamerGov_neural_dream.py -k test_000```
Error:
```
......
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 357, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 354, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 228, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 3124, in avg_pool2d
    assert len(padding) == 2
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AssertionError:
  target: aten.avg_pool2d.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda', torch.float32, size=[4, 4, 64, 64], stride=[16384, 4096, 64, 1]))
  ))
  args[1]: [7, 7]
  args[2]: [1, 1]
  args[3]: [0]
  args[4]: False
  args[5]: False

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96727
Approved by: https://github.com/jansel
2023-03-14 21:34:30 +00:00
PyTorch MergeBot
ba4fb9b6ad Revert "Default specialize_int to False (#96624)"
This reverts commit 1ac8782db2.

Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py
2023-03-14 19:43:47 +00:00
Edward Z. Yang
1ac8782db2 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-14 18:37:47 +00:00
Horace He
2a08a62777 Add extra metadata (as comments) to Inductor generated code (#96581)
New output
<img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581
Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym
2023-03-14 03:59:59 +00:00
Nicolas Macchioni
f673ad6d5c Add a new knob to separately enable the autotuning in Triton. (#96440)
Summary: separate triton pointwise autotune from matmul autotune, work done by ckluk

Test Plan: sandcastle + CI

Differential Revision: D43955699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96440
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-13 19:09:27 +00:00
Shunting Zhang
9aa216cb46 reland #96249: [inductor] show more kernel specific metrics in the benchmark result (#96461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96461
Approved by: https://github.com/ngimel
2023-03-10 06:18:21 +00:00
Shunting Zhang
cc699c56dc reland #96248 [inductor] show performance for each autotune config for a kernel (#96458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458
Approved by: https://github.com/ngimel
2023-03-10 01:40:04 +00:00
Natalia Gimelshein
05b679ce6a [inductor] don't match indirect indexing in fusion (#96273)
Fixes #96064

When deciding whether to fuse nodes, we match indexing like `c0 + 5 * tmp0`, but `tmp0` in the different nodes can refer to totally different values. Even when `tmp0` is the same (like in the added test) inductor still generates wrongly ordered loads and stores (loads come before stores), so better just disable this fusion altogether. We should fix wrong order also:
```
@pointwise(size_hints=[8], filename=__file__, meta={'signature': {0: '*i64', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, XBLOCK : tl.constexpr):
    xnumel = 5
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0_load = tl.load(in_ptr0 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tl.load(out_ptr0 + (x0 + (5*tmp0)), xmask)
    tl.store(out_ptr0 + (x0 + (5*tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask)
    tl.store(out_ptr1 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
```
Note: we are loading from `out_ptr0` here (that shouldn't happen), we are loading from it before storing to it.
After this PR, the kernel above is split in 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96273
Approved by: https://github.com/jansel
2023-03-09 23:03:46 +00:00
Horace He
5bbec680d7 Fix usages of contextmanager without finally (#96170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-08 20:59:27 +00:00
Horace He
30237e7aec Provide more informative kernel names in Inductor (#95940)
Before: `triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14`
After: `triton_fused__native_batch_norm_legit_functional_convolution_relu_14`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95940
Approved by: https://github.com/SherlockNoMad, https://github.com/ngimel, https://github.com/jansel
2023-03-07 18:02:10 +00:00
Jason Ansel
95d17dc93d [inductor] Reland #95567 part 1 (#96023)
This is the non-problematic part of #95567.  The errors were coming from
IR printing changes which will be next in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96023
Approved by: https://github.com/ngimel, https://github.com/mlazos
2023-03-06 22:57:22 +00:00
Shunting Zhang
962b3f78bd [inductor] run all kernel benchmarks individually in a compiled module (#95845)
This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace.

Here are the steps:
1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training
```
2. From the output we will see 3 lines like
```
Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py
```
That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module.

3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file.

```
python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k
```

Example output:
<img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png">

Note: I use the first 10 characters of the hash to identify each kernel since
1. hash is easier to get in the code :)
2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash)

If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845
Approved by: https://github.com/Chillee
2023-03-06 21:30:33 +00:00
Horace He
e8cd173aae Fix node provenance tracking (#95901)
Before:
```
triton_fused_add_83_add_84_convolution_15_relu_12_relu_13_squeeze_46_var_mean_15_14
```

After:
```
triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14
```

For this kernel
```
@persistent_reduction(
    size_hints=[512, 64],
    reduction_hint=ReductionHint.INNER,
    filename=__file__,
    meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32', 6: '*fp32', 7: '*fp32', 8: '*fp32', 9: '*fp32', 10: 'i32', 11: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['in_out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), equal_to_1=())]}
)
@triton.jit
def triton_(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, out_ptr2, out_ptr3, out_ptr4, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 512
    rnumel = 49
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (49*x0)), rmask & xmask, other=0)
    tmp8 = tl.load(in_ptr1 + (x0), xmask)
    tmp22 = tl.load(in_ptr2 + (x0), xmask)
    tmp24 = tl.load(in_ptr3 + (x0), xmask)
    tmp30 = tl.load(in_ptr4 + (x0), xmask)
    tmp2 = tl.where(rmask & xmask, tmp0, 0)
    tmp3 = tl.sum(tmp2, 1)[:, None]
    tmp4 = 49.0
    tmp5 = tmp3 / tmp4
    tmp6 = 0.1
    tmp7 = tmp5 * tmp6
    tmp9 = 0.9
    tmp10 = tmp8 * tmp9
    tmp11 = tmp7 + tmp10
    tmp12 = tmp0 - tmp5
    tmp13 = tmp12 * tmp12
    tmp15 = tl.where(rmask & xmask, tmp13, 0)
    tmp16 = tl.sum(tmp15, 1)[:, None]
    tmp17 = tmp16 / tmp4
    tmp18 = 1e-05
    tmp19 = tmp17 + tmp18
    tmp20 = tl.libdevice.rsqrt(tmp19)
    tmp21 = tmp12 * tmp20
    tmp23 = tmp21 * tmp22
    tmp25 = tmp23 + tmp24
    tmp26 = tl.where(0 != 0, 0, tl.where(0 > tmp25, 0, tmp25))
    tmp27 = 1.0208333333333333
    tmp28 = tmp17 * tmp27
    tmp29 = tmp28 * tmp6
    tmp31 = tmp30 * tmp9
    tmp32 = tmp29 + tmp31
    tl.store(in_out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp5, xmask)
    tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp11, xmask)
    tl.store(out_ptr2 + (r1 + (49*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp26, rmask & xmask)
    tl.store(out_ptr3 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp20, xmask)
    tl.store(out_ptr4 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp32, xmask)
```

Tbh this still isn't super great provenance tracking, since ops like layernorms are decomposed. I might add some extra provenance tracking during decompositions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95901
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-03-05 21:52:48 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00