Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496
Approved by: https://github.com/malfet
Summary:
- Remove onnx bench related scripts and `_onnx` folder.
- Update `common.py` to include onnx related patches previously under `_onnx` folder.
- Update `merge_rules.json` to include bench files.
- Added quick sanity onnx bench test to onnx CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103983
Approved by: https://github.com/kit1980
Fix https://github.com/pytorch/pytorch/issues/109736 .
HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue
I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.
The fix is to make sure we restore the rng state when we generate random inputs for max-autotune benchmarking.
TBH, I can not fully explain the root cause although I know it's caused by rng state change. AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.
Repro:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```
We always repro the issue without the PR but pass the accuracy check with the PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:
* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.
This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.
Differential Revision: D49502318
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
`python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times:
```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789
(errno: 98 - Address already in use).
```
Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`).
This PR addresses this by fixing two issues:
1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info
2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus
I am also adding a CI test to cover this type of issue in #109311
### Test plan
parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess`
child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657
Approved by: https://github.com/H-Huang
Previously, the code for passing inputs to exported program was:
```
if kwargs:
return (args, kwargs)
else:
return args
```
However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.
So I updated the code to just always keep the kwargs around.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
**This PR is a 99% copy paste of Sam Gross** (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there
--------
The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself.
This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks.
-----
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528
Approved by: https://github.com/ezyang
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.
Reviewed By: frank-wei
Differential Revision: D48905038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.
Test Plan:
test/inductor/test_aot_inductor.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
Generate diagnostic reports to monitor the internal stages of the export process. This tool aids in unblocking model exports and debugging the exporter.
#### Settings
~~1. Choose if you want to produce a .sarif file and specify its location.~~
1. Updated: saving .sarif file should be done by `export_output.save_sarif_log(dst)`, similar to saving exported onnx model `export_output.save(model_dst)`.
2. Customize diagnostic options:
- Set the desired verbosity for diagnostics.
- Treat warnings as errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106741
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/malfet
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/
We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.
In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.
Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.
All the tests in `tests/torch_np` take about 75s to run.
This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
https://github.com/pytorch/pytorch/issues/105555
Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.
```
def aot_compile(
f: Callable,
args: Tuple[Any],
kwargs: Optional[Dict[str, Any]] = None,
constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
Includes stable diffusion, whisper, llama7b and clip
To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet
Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang
`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
### Description
As an alternative to PR #105774, which provides a standalone, end-to-end minification script that covers all types of failures and has more functionality, this PR adds the ability to minify models when they fail the eval loop (accuracy checks). Both this PR and the other one can be merged without issue.
### Purpose
The goal is to leverage the minifier to minify models that fail accuracy checks, allowing failed models to be debugged more easily. The ideal use-case is trying to run a model suite on a backend where operator coverage is not known or is limited. If models can compile but fails the eval loop, having the repro script for each model is valuable for any developer that's trying to fix the issue.
### Functionality
- Create minify flag that minifies models when they fail accuracy check
- Produce minified graph for each model, and save it into repro script
- Move repro script to output directory/base Dynamo directory
- Enable functionality for running an entire model suite (Hugging Face, timm, and TorchBench) by prepending model name to repro script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106201
Approved by: https://github.com/ezyang