This PR adds the autotune Infrastructure for CPU. It generalizes and extends `BenchmarkRequest` with CPU support and C++ module loader. A `do_bench_cpu` util function is added for benchmarking functions on CPU with warmups and returns the median number from multiple trials.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125159
Approved by: https://github.com/jansel
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches
In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown
My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )
For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
This diff implements a remote caching strategy (memcache for internal and redis for external) for caching of Inductor FX Graph to Inductor generated wrapper file.
It uses the same idea with the autotuning result cache that is currently live.
This will land turned off and before turning this on by default, I will do more testing and including looking at the dynamic shape guards added by inductor.
Differential Revision: [D56441624](https://our.internmc.facebook.com/intern/diff/D56441624/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124669
Approved by: https://github.com/jansel, https://github.com/eellison
On MacOS 14.4, system Python is configured to point to a non-existing include dir
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('include'))"
/Library/Python/3.9/include
```
Workaround the issue by composing path to include folder from `stlib` config, which points to
```
% /usr/bin/python3 -c "import sysconfig;print(sysconfig.get_path('stdlib'))"
/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125285
Approved by: https://github.com/kit1980
Summary: Discovered breakages by enabling codecache by default and doing a CI run. I'll commit these fixes first and eventually enabling caching by default will (hopefully) be a one-liner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125258
Approved by: https://github.com/eellison
This doesn't introduce any new behavior, but sets up a basic cache key generation mechanism that I can test. From here I will:
- Add checks on the ops in an input FXGraph to make sure they are safe to cache. We'll be conservative in the first version here.
- Add serialization for FX graphs
- Save these FX graphs to disk in the cache
- Support graphs with more complicated ops like higher order ops and specialized nn modules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124642
Approved by: https://github.com/aorenste
In #124640 I see the error
```
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load
compiled_graph = FxGraphCache._lookup_graph(key, example_inputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph
write_atomic(artifact_path, graph.source_code)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic
with tmp_path.open(write_mode) as f:
File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp'
```
Which is fixed by creating the parent directory first. Since this is what you
want to do in most cases, I add an argument to `write_atomic` to do so itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646
Approved by: https://github.com/lezcano
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Fixes#104729
This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.
Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.
Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)
with torch.set_grad_enabled(False):
for _ in range(50):
compiled_model(inputs) #Warmup
print("Warmup over")
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(100):
compiled_model(inputs)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire