In #124640 I see the error
```
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load
compiled_graph = FxGraphCache._lookup_graph(key, example_inputs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph
write_atomic(artifact_path, graph.source_code)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic
with tmp_path.open(write_mode) as f:
File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp'
```
Which is fixed by creating the parent directory first. Since this is what you
want to do in most cases, I add an argument to `write_atomic` to do so itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646
Approved by: https://github.com/lezcano
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Two changes:
- in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF.
- Share a single precompilation function among matmuls with same key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642
Approved by: https://github.com/shunting314
ghstack dependencies: #124030
Fixes#104729
This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%.
Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner.
Script attached below.
Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py`
[TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt)
```python
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
model = nn.Softmax().eval()
compiled_model = torch.compile(model)
inputs = torch.randn(1024, 1024)
with torch.set_grad_enabled(False):
for _ in range(50):
compiled_model(inputs) #Warmup
print("Warmup over")
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in range(100):
compiled_model(inputs)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Check if the compiled model inference and the eager model inference are similar using torch.allclose
print(torch.allclose(compiled_model(inputs), model(inputs)))
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584
Approved by: https://github.com/jgong5, https://github.com/malfet
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409
Approved by: https://github.com/desertfire
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights`
Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question
Added unites to validate that it works as expected
TODO:
- Extend support to CUDA
- munmap region if the same library is reused
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
https://github.com/pytorch/pytorch/pull/123164 removed the below code (so that constants are not readonly) to support module buffer mutation:
a9a9ce6d9c/torch/_inductor/codecache.py (L1685-L1691)
However, it may cause relocation overflow when the `.data` section is large.
Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section
```
.text
.rodata
.data
.bss
.lrodata
.ldata
```
We met this issue when fixing https://github.com/pytorch/pytorch/issues/114450 and running the below models on CPU:
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BlenderbotForCausalLM
- DebertaV2ForMaskedLM
- DebertaV2ForQuestionAnswering
- XGLMForCausalLM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123639
Approved by: https://github.com/jgong5, https://github.com/desertfire
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config._force_mmap_aoti_weights`
Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question
Added unites to validate that it works as expected
TODO:
- Extend support to CUDA
- munmap region if the same library is reused
Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
Summary: In `codecache.py` pass the device_interface directly to `_worker_compile()` instead of calling `get_device_interface()` from inside the function.
If the device_interface is registered by an out-of-tree module then it will only be registered inside the main process and not inside the worker process. This fixes this issue. Happy to add a test if required.
Test plan:
No tests added
Co-authored-by: brothergomez <brothergomez@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122492
Approved by: https://github.com/ezyang
Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts.
Test Plan:
- New unit test
- All existing inductor tests will exercise fresh_inductor_cache()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661
Approved by: https://github.com/oulgen
Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed:
- Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11.
- The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.")
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146
Approved by: https://github.com/jansel
This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions)
Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS:
- https://github.com/pytorch/pytorch/pull/122511
- https://github.com/pytorch/pytorch/pull/122513
- https://github.com/pytorch/pytorch/pull/122580
- https://github.com/pytorch/pytorch/pull/122608
Following was added/changed to enable vectorization code to work on MacOS
- Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs
- Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types
- Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details)
See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro:
| dtype | Eager | Compile (before) | Compile (after) |
| ------ | ------ | --------- | --------- |
| bfloat16 | 120 tokens/sec | 130 tokens/sec | 156 tokens/sec |
| float32 | 158 tokens/sec | 140 tokens/sec | 236 tokens/sec |
| float16 | 235 tokens/sec | 81 tokens/sec | 58 tokens/sec |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217
Approved by: https://github.com/jansel
Summary:
We're getting errors that Python.h is not found because we didn't have
the proper include path set up for it.
bypass-github-export-checks
Test Plan: I can only get this to show up in Bento: N5106134
Reviewed By: hl475, chenyang78
Differential Revision: D55133110
Co-authored-by: Bert Maher <bertrand@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363
Approved by: https://github.com/bertmaher
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:
- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.
In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.
Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.
Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996