Commit Graph

44 Commits

Author SHA1 Message Date
PyTorch MergeBot
def4959662 Revert "[inductor] allow mm template to accumulate with float16 dtype (#117479)"
This reverts commit a7fbbc2a4a.

Reverted https://github.com/pytorch/pytorch/pull/117479 on behalf of https://github.com/PaliC due to breaking tests internally ([comment](https://github.com/pytorch/pytorch/pull/117479#issuecomment-1899032973))
2024-01-18 18:53:37 +00:00
Guoliang He
a7fbbc2a4a [inductor] allow mm template to accumulate with float16 dtype (#117479)
Fixes #108621

replace #108637 and #108982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117479
Approved by: https://github.com/jansel
2024-01-17 21:01:14 +00:00
Jiong Gong
715d663794 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 21:21:10 +00:00
PyTorch MergeBot
66994bca5f Revert "[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)"
This reverts commit 653acd8fe1.

Reverted https://github.com/pytorch/pytorch/pull/115479 on behalf of https://github.com/desertfire due to will cause land race in fbcode because https://github.com/pytorch/pytorch/pull/115831 is already landed internally ([comment](https://github.com/pytorch/pytorch/pull/115479#issuecomment-1857979948))
2023-12-15 14:35:40 +00:00
Jiong Gong
653acd8fe1 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 04:04:08 +00:00
Bin Bao
5a96a42cea [AOTI] Improve the two-pass wrapper codegen (#114067)
Summary: For the second-pass, we don't have to rerun the whole inductor flow again. This PR moves that second-pass to the codegen time. This change not only speeds up the compilation, but also removes kernel scheduling inconsistency between the two passes. Another future improvement is to make the second-pass reuse the scheduler and do the wrapper codegen only.

This is a copy of https://github.com/pytorch/pytorch/pull/113762 to land in github first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114067
Approved by: https://github.com/chenyang78
2023-11-19 23:30:36 +00:00
PyTorch MergeBot
1e60174891 Revert "[dynamo] Add run_inductor_tests entrypoint (#113278)"
This reverts commit b00311ce9e.

Reverted https://github.com/pytorch/pytorch/pull/113278 on behalf of https://github.com/huydhn due to Sorry for reverting your stack, but it is failing to list test internally with buck2 ([comment](https://github.com/pytorch/pytorch/pull/113278#issuecomment-1811646325))
2023-11-15 01:19:48 +00:00
Jason Ansel
b00311ce9e [dynamo] Add run_inductor_tests entrypoint (#113278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113278
Approved by: https://github.com/yanboliang
2023-11-11 08:54:43 +00:00
Sam Larsen
4a09ed5459 [inductor] Parallelize Max Autotune step 2: Use multiple GPUs (#109127)
Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109127
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #109126
2023-09-14 00:37:39 +00:00
Ying Zhang
b2d764ece0 [Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901)
This is the step 3 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: https://github.com/pytorch/pytorch/issues/106991.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901
Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng
ghstack dependencies: #107802, #107847
2023-09-12 17:44:36 +00:00
PyTorch MergeBot
c36c2bfcb2 Revert "[inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983)"
This reverts commit 2c61313ff3.

Reverted https://github.com/pytorch/pytorch/pull/107983 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/107983#issuecomment-1714816358))
2023-09-12 01:08:08 +00:00
Sam Larsen
2c61313ff3 [inductor] Parallelize Max Autotune step 2: Use all GPUs (#107983)
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them.

Test Plan:
`python test/inductor/test_max_autotune.py`
`TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`
`TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107983
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #107982
2023-09-10 15:43:03 +00:00
constroy
0578732bc3 [inductor] fix duplicate arg handling in triton templates (#105315)
Fixes #105212

De-duplicate kernel args in codegen and autotuning of `torch.mm` and `torch.bmm`.

refer to https://github.com/pytorch/pytorch/issues/105212#issuecomment-1637168866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105315
Approved by: https://github.com/jansel
2023-07-20 07:46:46 +00:00
Bin Bao
528ab477ce [reland][inductor] Register an op for mm_plus_mm (#105153)
Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues

Test Plan: CI

Differential Revision: D47442849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153
Approved by: https://github.com/clee2000
2023-07-14 14:35:29 +00:00
Catherine Lee
c36dca7bc5 Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
This reverts commit 9c46a1620c.

Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149

#104835 is causing internal builds to fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150
Approved by: https://github.com/atalman
2023-07-13 17:13:45 +00:00
Bin Bao
9c46a1620c [inductor] Register an op for mm_plus_mm (#104835)
Summary: Currently the aten version of mm_plus_mm has no cpp
implementation, and thus cpp_wrapper can not generate the correct cpp
function call for it.

Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835
Approved by: https://github.com/jansel, https://github.com/SherlockNoMad
2023-07-12 02:34:02 +00:00
Jack Taylor
c9a806be28 [ROCm] enable additional inductor/dynamo UTs (#104624)
Enables additional inductor UTs on ROCm and un skips outdated skips.

I have also removed a group of failures in `test_torchinductor_opinfo` which are now passing for CUDA and ROCm

```
-    # The following 3 tests fail on CUDA with AssertionError: expected size 5==5, stride 5==1 at dim=0
-    # linalg._svd's return value has different strides on CUDA vs CPU which causes this
-    # In test_meta.py there is a mechanism to skipping strides checks for some ops
-    # (including _linalg_svd), possibly we should have something similar here
-    "linalg.cond": {f32, f64},
-    "linalg.svdvals": {f32, f64},
-    "linalg.matrix_rank": {f32, f64},
-    "linalg.svd": {f32, f64},
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104624
Approved by: https://github.com/malfet
2023-07-11 20:44:02 +00:00
Adnan Akhundov
4911b80b8e [inductor] addmm + ReLU / GELU fusion pass (#104132)
Summary:

Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.

Test Plan:

$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v

(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
  warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok

----------------------------------------------------------------------
Ran 1 test in 13.415s

OK

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
2023-07-10 16:44:14 +00:00
Jack Taylor
ede1965f5d Enable additional inductor test suites on ROCm (#102270)
Enables additional inductor UTs on ROCm, following from https://github.com/pytorch/pytorch/pull/100981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102270
Approved by: https://github.com/malfet
2023-06-22 00:36:35 +00:00
Edward Z. Yang
bc6ec97e02 Switch dynamic_shapes to True by default (#103597)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103597
Approved by: https://github.com/voznesenskym
2023-06-15 15:16:20 +00:00
Bin Bao
fbbde8df69 [inductor] fix a numel expr codegen issue (#103005)
Summary: Correctly use pexpr or cexpr for generating symbolic expression
during wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Bin Bao
49577c7e47 [inductor] Turn off autotune_cublasLt for cpp_wrapper (#103004)
Summary: bias_addmm is not backed up by a cpp funciton, so turn
autotune_cublasLt for cpp_wrapper + max_autotune. We can add a cpp
function implementation if there is a performance need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103004
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Bin Bao
44fdfd3222 [inductor] Support select_algorithm with cpp_wrapper (#103003)
Summary: This is one step towards getting cpp_wrapper work with max_autotune.
Switch to use unique kernel name to cache generated cubin file.

This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
Jason Ansel
0c6f409cda [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-20 03:43:33 +00:00
PyTorch MergeBot
5f07c589b0 Revert "[inductor] Refactor RNG operators (#100064)"
This reverts commit 3bbf0683a1.

Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))
2023-05-17 21:16:41 +00:00
Jason Ansel
3bbf0683a1 [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-17 01:29:31 +00:00
Jack Taylor
187eb7ca88 Enable default workflow PyT 2.0 UTs on ROCm stack (#100981)
PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack.

- Enables all the dynamo unit test suites
- Enables some of the inductor unit test suites
       - `test_config`
       - `test_cpp_wrapper` (cpu only)
       - `test_minifier`
       - `test_standalone_compile`
       - `test_torchinductor_dynamic_shapes`
       - `test_torchinductor_opinfo`
       - `test_torchinductor`
       - `test_triton_wrapper`
- Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py

Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-05-15 23:45:04 +00:00
Jason Ansel
5079bf3df6 [inductor] Add variable names to MemoryDep (#100308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100308
Approved by: https://github.com/eellison
2023-05-08 20:08:58 +00:00
PyTorch MergeBot
629377ea8b Revert "Replace _dynamo.config with an object instead of module (#96455)"
This reverts commit 420104a886.

Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/jansel due to BC breaking, was landed prematurely
2023-04-12 15:06:14 +00:00
Han Qi
420104a886 Replace _dynamo.config with an object instead of module (#96455)
Summary:
    Replace _dynamo.config with an object instead of module

    Current usage patterns of setting and reading fields on config will work
    unchanged.

    Only changes needed going forward:
    1. import torch._dynamo.config will not work. However, just doing
       import torch._dynamo is sufficient to access dynamo config
       as torch._dynamo.config.

    2. Files inside of _dynamo folder need to access config via
       from torch._dynamo.config_util import config instead of
       from torch._dynamo import config. Because _dynamo/__init__.py
       imports some of the files so it would be circular import.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/williamwen42
2023-04-11 21:23:32 +00:00
Shunting Zhang
c681c52e01 [inductor] fix TritonTemplateCaller.__str__ (#97578)
We remove TritonTemplateCaller.to_callable previously. But this method is still used in `TritonTemplateCaller.__str__` . The to_callable method in the base class will be used and raise an exception.

This PR fix TritonTemplateCaller.__str__ to return the string representation without calling to_callable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97578
Approved by: https://github.com/nmacchioni, https://github.com/ngimel
2023-03-30 20:23:02 +00:00
Christian Puhrsch
9d37cefcb0 Resubmit _int_mm (#96685)
Avoids any changes to gemm_and_bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685
Approved by: https://github.com/drisspg, https://github.com/ngimel
2023-03-27 16:14:07 +00:00
Natalia Gimelshein
f09347a9f1 [inductor] Fix broadcast of random seed in mm epilogue (#97591)
Fixes #96468, #97553

In matmul codegen epilogue we use `mask` shape to infer the broadcasted shape in case we need to broadcast a scalar value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97591
Approved by: https://github.com/jansel
2023-03-26 03:35:03 +00:00
Jason Ansel
9370f253e3 [inductor] Rewrite convolution triton templates (#95556)
Fixes #95775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556
Approved by: https://github.com/Chillee, https://github.com/ngimel
2023-03-22 18:12:23 +00:00
Christian Puhrsch
0a53c9624a Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)" (#96885)
Summary:
Backing out  _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885
Approved by: https://github.com/drisspg
2023-03-16 05:32:55 +00:00
Christian Puhrsch
1fe2a9d122 Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-27 20:27:25 +00:00
Jason Ansel
6e61629f10 [inductor] Refactors/improvements to max-autotune (#95554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95554
Approved by: https://github.com/ngimel, https://github.com/nmacchioni
2023-02-26 22:39:04 +00:00
Natalia Gimelshein
8e391c735f use 4 warps for small block config in mm (#95339)
Temporary Fix for #95312
In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway.
Triton main is supposed to have better behavior for these pathological, but we are not on main yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95339
Approved by: https://github.com/ezyang, https://github.com/Chillee
2023-02-23 03:03:42 +00:00
Nicolas Macchioni
17d0b7f532 [pt2][inductor]global autotuning cache (#94922)
Summary:
this diff adds logic to handle a global autotuning cache, stored in json format at config.global_cache_path.

what is changing from `DiskCache`:
* `DiskCache` is renamed to `PersistentCache`
* the local cache is now stored as a single file in json format, located at `/tmp/torchinductor_{$USER}/local_cache`. the file contains a dictionary structure like `local_cache[name][inputs][choice]` where `name` is the type of operation, like `addmm`, `inputs` is the repr of the inputs, and `choice` is the hash of a `ChoiceCaller`. the stored value is the benchmark time for that `ChoiceCaller`.
* a global cache is added, initially stored at `fbcode/caffe2/torch/_inductor/global_cache`, with almost identical format as the local cache. since the global cache exists over different machines, there is an additional `dinfo` field, such that `global_cache[dinfo] = local_cache` (at least structure wise, there is no guarantee that the global cache and local cache share the same values). `dinfo` is just a repr of the cuda device properties.
* the autotuner will prioritize the global cache, and return values from there first, before looking in the local cache
* the autotuner will look in both the global cache and the local cache even when `max_autotune=False`, but will still only generate values if `max_autotune=True`.
* the autotuner will log global cache hits and misses to a scuba table (inductor_autotuning_cache) which will be used to update the global cache at regular intervals

Test Plan: D43285472

Differential Revision: D42785435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94922
Approved by: https://github.com/jansel
2023-02-19 05:35:18 +00:00
Jason Ansel
45eadc2c4d ConfigModule for _{dynamo,inductor}.config (#93252)
This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252
Approved by: https://github.com/voznesenskym
2023-02-01 19:38:05 +00:00
Jason Ansel
8c09a005c5 [inductor] Pattern matching engine (copy) (#93291)
This is an exact duplicate of https://github.com/pytorch/pytorch/pull/90739

The fbcode workflow for landing that diff seems buggy.  The github-export-checks task is failing with credentials errors.  Plan to try to land it using GH1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93291
Approved by: https://github.com/desertfire
2023-01-31 04:51:00 +00:00
Jason Ansel
7c1c239db1 [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575)
This reverts commit 94262efc7d to reland #91105 / #90738.

Fixes https://github.com/pytorch/torchdynamo/issues/2015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575
Approved by: https://github.com/ngimel
2023-01-11 00:08:03 +00:00
PyTorch MergeBot
94262efc7d Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)"
This reverts commit d6dd2e97da.

Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds
2022-12-21 00:02:38 +00:00
Jason Ansel
d6dd2e97da [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)
https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land.  This is an exact copy that PR without using ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105
Approved by: https://github.com/ngimel
2022-12-20 02:38:23 +00:00