If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.
Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.
Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161586
Approved by: https://github.com/ezyang
ghstack dependencies: #161466
Summary:
This is a re-land of [PR161040](https://github.com/pytorch/pytorch/pull/161040), which had previously caused test failures on AMD GPUs. The tests are now configured to target only NVIDIA GPUs.
This diff removes configurations that exceed the hardware shared memory limit, which causes the following compilation error:
```
No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
```
Test Plan:
```
pytest test/inductor/test_max_autotune.py
pytest test/inductor/test_triton_heuristics.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161996
Approved by: https://github.com/coconutruben
As the tile stated.
As the document grows, the content will become more and more, so in order to make it easier for users to read and easier for developers to maintain, we have split this file into several separate files and placed them in a dedicated directory called "accelerator".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161845
Approved by: https://github.com/albanD
Adds fallback commands for the following:
* python setup.py install
* python setup.py develop
Ideally these should just work and should provide backwards compat.
Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone.
This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009
Approved by: https://github.com/clee2000, https://github.com/atalman
# why
- some templates e.g. scale_mm need to unsqueeze/squeeze the nodes
for codegen and heuristics
- unified place where we can just adjust them for the template
# what
- inside get_mm_configs, return not the passed in kernel inputs,
but allow the template heuristic to adjust them if necessary
- the default implementation right now just passes them back
this diff just adds the functionality, but does not exercise it
other than the default (passthrough)
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520572](https://our.internmc.facebook.com/intern/diff/D81520572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161339
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125, #161126, #161336, #161338
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
- use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the suffix args
and epilogue function/fn for scaled_mm
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D80670914](https://our.internmc.facebook.com/intern/diff/D80670914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161126
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124, #161125
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
- use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the prefix args
and epilogue function/fn for addmm/baddbmm
- expand kernelinputs to also be able to shuttle around non tensor
inputs (scalars) as is needed for alpha and beta
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k addmm
```
Differential Revision: [D80670912](https://our.internmc.facebook.com/intern/diff/D80670912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161125
Approved by: https://github.com/jansel
ghstack dependencies: #161123, #161124
# why
- another step towards get_mm_configs providing
all the kwargs needed to add a choice from
a template. This in turn will allow us to send
all templates through one single call, and handle modifications
# what
use the infrastructure for template heuristics to provide extra kwargs
that are fixed for a template/op pair to provide the workspace_arg for
all the tma templates
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v -k tma
```
Differential Revision: [D80670915](https://our.internmc.facebook.com/intern/diff/D80670915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161124
Approved by: https://github.com/jansel
ghstack dependencies: #161123
# why
- some kwargs are choice independent but rather
always the same for a specific op or template
- this enables us to track those differently than the
choice ones, and thus enables interception of them
cleaner
- maybe_append_choices can then be simplified to
just pass through the kwargs
# what
- hookup for template heuristics to have per template/op extra
kwargs that are always the same, for all choices
- hookup for the called to get_mm_configs to provide template/op
kwargs to override some of the template/choice kwargs
this pr does not use the new machinery, and everything is empty
for now. subsequent prs start using it to simplify ops
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D80670916](https://our.internmc.facebook.com/intern/diff/D80670916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161123
Approved by: https://github.com/jansel
The functions here expect to return pointers, but currently aren't returning anything. Make them return NULL.
The properties array wants an extra set of braces. One pair for the array, another for the first item in the array.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161792
Approved by: https://github.com/Skylion007
As titled, so that the `getmem` calls in the loop are non-blocking, so that we max out the issuance rate.
Also had a single `nvshmem_quiet()` at the end to make sure all the getmem calls complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162006
Approved by: https://github.com/ngimel
Previously, without calling `torch.empty` before NVSHMEM init, we see error below:
```
src/host/init/init.cu:nvshmemi_check_state_and_init:1117: nvshmem initialization failed, exiting
src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
```
Fixing it by calling a `cudaFree(nullptr)` to make sure CUDA runtime is initialized before NVSHMEM init.
Removing all `torch.empty(1)` calls from tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161232
Approved by: https://github.com/ngimel
ghstack dependencies: #161214
We previously assume aot precompile should only work on non closures. This is hard to enforce in practice because we will see a lot of cases with decorater (e.g. hugging face models)
```
def check_inputs(fn):
def _fn(self, *args, **kwargs):
for arg in args:
assert arg.shape[0] > 1
return fn(*args, **kwargs)
return _fn
@check_inputs
def foo(x, y):
a = x + x
b = y + y
c = a + b
return c
```
It doesn't make sense to not support these cases since they are straightfowrad to do.
This PR adds the logic to handle closure and make sure they can be precompiled properly.
Differential Revision: [D81509535](https://our.internmc.facebook.com/intern/diff/D81509535/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161990
Approved by: https://github.com/angelayi