Summary:
Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.
Test Plan:
$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v
(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok
----------------------------------------------------------------------
Ran 1 test in 13.415s
OK
Reviewers: @eellison
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
Summary: bias_addmm is not backed up by a cpp funciton, so turn
autotune_cublasLt for cpp_wrapper + max_autotune. We can add a cpp
function implementation if there is a performance need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103004
Approved by: https://github.com/jansel
PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack.
- Enables all the dynamo unit test suites
- Enables some of the inductor unit test suites
- `test_config`
- `test_cpp_wrapper` (cpu only)
- `test_minifier`
- `test_standalone_compile`
- `test_torchinductor_dynamic_shapes`
- `test_torchinductor_opinfo`
- `test_torchinductor`
- `test_triton_wrapper`
- Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py
Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
Summary:
Replace _dynamo.config with an object instead of module
Current usage patterns of setting and reading fields on config will work
unchanged.
Only changes needed going forward:
1. import torch._dynamo.config will not work. However, just doing
import torch._dynamo is sufficient to access dynamo config
as torch._dynamo.config.
2. Files inside of _dynamo folder need to access config via
from torch._dynamo.config_util import config instead of
from torch._dynamo import config. Because _dynamo/__init__.py
imports some of the files so it would be circular import.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/williamwen42
We remove TritonTemplateCaller.to_callable previously. But this method is still used in `TritonTemplateCaller.__str__` . The to_callable method in the base class will be used and raise an exception.
This PR fix TritonTemplateCaller.__str__ to return the string representation without calling to_callable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97578
Approved by: https://github.com/nmacchioni, https://github.com/ngimel
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
Temporary Fix for #95312
In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway.
Triton main is supposed to have better behavior for these pathological, but we are not on main yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95339
Approved by: https://github.com/ezyang, https://github.com/Chillee
Summary:
this diff adds logic to handle a global autotuning cache, stored in json format at config.global_cache_path.
what is changing from `DiskCache`:
* `DiskCache` is renamed to `PersistentCache`
* the local cache is now stored as a single file in json format, located at `/tmp/torchinductor_{$USER}/local_cache`. the file contains a dictionary structure like `local_cache[name][inputs][choice]` where `name` is the type of operation, like `addmm`, `inputs` is the repr of the inputs, and `choice` is the hash of a `ChoiceCaller`. the stored value is the benchmark time for that `ChoiceCaller`.
* a global cache is added, initially stored at `fbcode/caffe2/torch/_inductor/global_cache`, with almost identical format as the local cache. since the global cache exists over different machines, there is an additional `dinfo` field, such that `global_cache[dinfo] = local_cache` (at least structure wise, there is no guarantee that the global cache and local cache share the same values). `dinfo` is just a repr of the cuda device properties.
* the autotuner will prioritize the global cache, and return values from there first, before looking in the local cache
* the autotuner will look in both the global cache and the local cache even when `max_autotune=False`, but will still only generate values if `max_autotune=True`.
* the autotuner will log global cache hits and misses to a scuba table (inductor_autotuning_cache) which will be used to update the global cache at regular intervals
Test Plan: D43285472
Differential Revision: D42785435
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94922
Approved by: https://github.com/jansel