Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
**This PR is a 99% copy paste of Sam Gross** (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there
--------
The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself.
This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks.
-----
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528
Approved by: https://github.com/ezyang
Move detectron2_fcos_r_50_fpn to amp. The minifier showed the following snippet as causing the divergence, where inductor has better numerics than eager:
```
import torch
def foo(x):
return x > .2
inp = torch.tensor([.2002], device="cuda", dtype=torch.bfloat16)
print(foo(inp))
print(torch.compile(foo)(inp))
```
doctr_reco_predictor had very minimal divergence (.002 vs .001 required), bumping tolerance here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108598
Approved by: https://github.com/shunting314
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.
Reviewed By: frank-wei
Differential Revision: D48905038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985
### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao
### Changes Made
The majority of the changes in this pull request involve:
- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.
### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)
There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.
### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108
### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes
### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
Fixes inference accuracy for `doctr_reco_predictor` and `pyhpc_turbulent_kinetic_energy`.
For the `same(float, float)` comparison we weren't going through the more rigorous tensor comparison path which takes into account the fp64 base results.
Also return True when fp64 base result are not well formed (nan).
I debugged these models and the source of divergence were innocuous:
`doctr_reco_predictor` - can be fixed by turning off layout optimization, decomp for batch norm
`pyhpc_turbulent_kinetic_energy` - divergence caused because fused kernel keeps precision in fp32 instead of casting back and forth from/to fp32/bf16. Fused kernel is better precision, anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108202
Approved by: https://github.com/jansel
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.
Test Plan:
test/inductor/test_aot_inductor.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
Generate diagnostic reports to monitor the internal stages of the export process. This tool aids in unblocking model exports and debugging the exporter.
#### Settings
~~1. Choose if you want to produce a .sarif file and specify its location.~~
1. Updated: saving .sarif file should be done by `export_output.save_sarif_log(dst)`, similar to saving exported onnx model `export_output.save(model_dst)`.
2. Customize diagnostic options:
- Set the desired verbosity for diagnostics.
- Treat warnings as errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106741
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/malfet
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/
We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.
In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.
Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.
All the tests in `tests/torch_np` take about 75s to run.
This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
https://github.com/pytorch/pytorch/issues/105555
Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.
```
def aot_compile(
f: Callable,
args: Tuple[Any],
kwargs: Optional[Dict[str, Any]] = None,
constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
Includes stable diffusion, whisper, llama7b and clip
To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet
Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang
`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
### Description
As an alternative to PR #105774, which provides a standalone, end-to-end minification script that covers all types of failures and has more functionality, this PR adds the ability to minify models when they fail the eval loop (accuracy checks). Both this PR and the other one can be merged without issue.
### Purpose
The goal is to leverage the minifier to minify models that fail accuracy checks, allowing failed models to be debugged more easily. The ideal use-case is trying to run a model suite on a backend where operator coverage is not known or is limited. If models can compile but fails the eval loop, having the repro script for each model is valuable for any developer that's trying to fix the issue.
### Functionality
- Create minify flag that minifies models when they fail accuracy check
- Produce minified graph for each model, and save it into repro script
- Move repro script to output directory/base Dynamo directory
- Enable functionality for running an entire model suite (Hugging Face, timm, and TorchBench) by prepending model name to repro script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106201
Approved by: https://github.com/ezyang