Commit Graph

506 Commits

Author SHA1 Message Date
angelayi
a565f1bee6 [aotinductor] Skip benchmarks with control flow (#109661)
Since AOTInductor doesn't support control flow yet, we will skip over tests that are currently failing due to containing control flow in the code. Logs taken from https://hud.pytorch.org/benchmark/compilers?startTime=Tue%2C%2012%20Sep%202023%2022%3A56%3A40%20GMT&stopTime=Tue%2C%2019%20Sep%202023%2022%3A56%3A40%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=main&lCommit=2c1554a0323107d821be3ff13df7833b9f0b960d&rBranch=main&rCommit=47be61e12bd51df27182343d312dc3df485d5559

Errors documented in https://github.com/pytorch/pytorch/issues/105217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109661
Approved by: https://github.com/desertfire
2023-09-25 18:49:06 +00:00
PyTorch MergeBot
d9627c4264 Revert "[inductor] fix a max-autotune rng state related bug (#109828)"
This reverts commit 3663436db3.

Reverted https://github.com/pytorch/pytorch/pull/109828 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the rocm failure looks legit. There is also another numpy import error when running dynamo test on CPU ([comment](https://github.com/pytorch/pytorch/pull/109828#issuecomment-1732423883))
2023-09-23 22:35:37 +00:00
Shunting Zhang
3663436db3 [inductor] fix a max-autotune rng state related bug (#109828)
Fix https://github.com/pytorch/pytorch/issues/109736 .

HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way:
- make all calls to check_cache cache miss repro the issue
- make all cals to check_cache cache hit works around the issue

I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem.

The fix is to make sure  we restore the rng state when we generate random inputs for max-autotune benchmarking.

TBH, I can not fully explain the root cause although I know it's caused by rng state change.  AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune:
- maybe AOTAutograd misses some corner case to preserve the rng state
- maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this.

Repro:

```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency
```

We always repro the issue without the PR but pass the accuracy check with the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828
Approved by: https://github.com/eellison
2023-09-23 00:58:10 +00:00
Bin Bao
8856c1628e [inductor] Change AOTInductor to return output tensors (#109790)
Summary:
Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits:

* It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable.
* As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance.
* With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability.

This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela.

Differential Revision: D49502318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790
Approved by: https://github.com/chenyang78
2023-09-22 02:31:52 +00:00
Angela Yi
f7ddc54503 [aotinductor] Update performance benchmark code (109560) (#109820)
Summary: Same as #109560, made a new PR because we need to land from internal

Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup.

This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs.

For example,
```
python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM
```
results in `1.359x` speedup.

Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it.

[Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8)

Test Plan: CI

Differential Revision: D49513934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820
Approved by: https://github.com/desertfire
2023-09-21 20:49:41 +00:00
Simon Fan
ef8d461b09 Fix torchbench --multiprocess (#109657)
`python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times:

```
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789
 (errno: 98 - Address already in use).
```

Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`).

This PR addresses this by fixing two issues:
1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info
2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus

I am also adding a CI test to cover this type of issue in #109311

### Test plan
parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess`
child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657
Approved by: https://github.com/H-Huang
2023-09-21 16:53:07 +00:00
Mark Saroufim
0ec9f59f70 Loudly Error in dynamo bench if eager fails (#109536)
Helps debug https://github.com/pytorch/benchmark/issues/1901

I will wait until the ONNX beartype sev is fixed before merging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109536
Approved by: https://github.com/xuzhao9
2023-09-19 00:40:42 +00:00
angelayi
5b13f74e9b [export] Update how we input kwargs (#109160)
Previously, the code for passing inputs to exported program was:
```
if kwargs:
    return (args, kwargs)
else:
    return args
```

However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime.

So I updated the code to just always keep the kwargs around.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2023-09-19 00:04:32 +00:00
Animesh Jain
f786fbdebd Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109323
Approved by: https://github.com/huydhn, https://github.com/voznesenskym
2023-09-15 08:44:14 +00:00
Simon Fan
54c5f474a7 Forward rank and world size info to Torchbench models when using dynamo runner (#108438)
Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90

This is used for models which distribute over multiple GPUs e.g. simple_gpt https://github.com/pytorch/benchmark/pull/1867

Also add an option to skip multiprocess only gpu models

Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108438
Approved by: https://github.com/Chillee
2023-09-14 21:01:20 +00:00
angelayi
c3945b5f84 Update HF version to commit hash (6c26faa) (#107400)
Some [errors](https://ossci-raw-job-status.s3.amazonaws.com/log/15968424899) in the [torchinductor hf benchmarks](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Thu,%2010%20Aug%202023%2018:05:47%20GMT&stopTime=Thu,%2017%20Aug%202023%2018:05:47%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=main&lCommit=384e0d104fd077d31efafc564129660e9b7a0f25&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965) should be fixed in the most recent release (for example, this [line](c036c814f4/src/transformers/models/opt/modeling_opt.py (L688)) no longer exists). Additionally, I landed a [commit (6c26faa)](6c26faa159) to the HF transformers repro to fix one of the graph breaks. This PR results in [76% pass rate for the export + aot inductor HF benchmark!](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2022%3A45%3A09%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2022%3A45%3A09%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/hf_version&lCommit=0accaaca2fa70ca2f78c1a587dd4b6750448dd90&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107400
Approved by: https://github.com/ezyang, https://github.com/desertfire, https://github.com/malfet
2023-09-12 15:25:28 +00:00
PyTorch MergeBot
56c2386157 Revert "reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883)"
This reverts commit d4230e5574.

Reverted https://github.com/pytorch/pytorch/pull/108883 on behalf of https://github.com/huydhn due to Per the discussion thread on D49122208, reverting this change ([comment](https://github.com/pytorch/pytorch/pull/108883#issuecomment-1712707853))
2023-09-10 04:40:02 +00:00
Animesh Jain
d4230e5574 reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108883
Approved by: https://github.com/voznesenskym, https://github.com/huydhn
2023-09-09 03:12:31 +00:00
Bin Bao
e91f66471c [reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878
Approved by: https://github.com/muchulee8
2023-09-08 17:58:35 +00:00
PyTorch MergeBot
428f5f9e7e Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663)"
This reverts commit 366ce589d0.

Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))
2023-09-08 05:01:27 +00:00
PyTorch MergeBot
72f24d0001 Revert "[dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528)"
This reverts commit 34bb74c4cf.

Reverted https://github.com/pytorch/pytorch/pull/108528 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it has some nasty merge conflicts after the revert of D48910794. I need to revert this so the conflict could be resolved. Please help rebase this tomorrow and reland the change ([comment](https://github.com/pytorch/pytorch/pull/108528#issuecomment-1711034781))
2023-09-08 03:49:41 +00:00
Bin Bao
366ce589d0 [inductor] Switch to use the runtime interface for AOTInductor testing (#108663)
Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663
Approved by: https://github.com/ezyang
ghstack dependencies: #108653
2023-09-07 23:38:11 +00:00
Animesh Jain
34bb74c4cf [dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528)
**This PR is a 99% copy paste of Sam Gross** (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there

--------
The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself.

This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks.

-----

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528
Approved by: https://github.com/ezyang
2023-09-07 01:45:47 +00:00
JackCaoG
e73ec92ad2 Minor fixs to make torchbench runable on torch/xla (#107919)
`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for https://github.com/pytorch/xla/issues/4174.

`is_correct` reference has been deleted, I think it is a deadcode.

After this patch, I am able to run

```
python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107919
Approved by: https://github.com/shunting314, https://github.com/wconstab
2023-09-06 22:35:53 +00:00
Bin Bao
60bd30ee0b [inductor] Move AOTInductor runtime headers (#108564)
Summary: Move AOTInductor runtime header files into its own subdirectory, to separate them from to-be-added libtorch C interface.

Reviewed By: frank-wei

Differential Revision: D48905038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108564
Approved by: https://github.com/frank-wei
2023-09-06 11:50:41 +00:00
Bin Bao
28c5b62210 [inductor] Use empty_strided to create output tensors when testing AOTInductor (#108364)
Summary: This will fix 3 fail_accuracy failures in HF.

Test Plan:
```
python benchmarks/dynamo/huggingface.py --bfloat16 --accuracy --inference --device cuda --export-aot-inductor --only  T5Small
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108364
Approved by: https://github.com/angelayi
ghstack dependencies: #108412
2023-09-06 02:04:32 +00:00
Yanbo Liang
ff28b4b908 Fix dynamo benchmark config --print-graph-breaks (#108584)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108584
Approved by: https://github.com/anijain2305
2023-09-05 23:31:43 +00:00
Mark Saroufim
5f5caed25a do not cast all inputs in benchmarks (#108456)
Fixes why stable diffusion is not showing up in inference dashboard even though it shows up in training dashboard

The reason is stable diffusion in torchbench has a line like `input_tensor = input_tensor.long().to(self.device)` and if you cast this to a bfloat16 the inference will fail

<img width="1705" alt="Screenshot 2023-09-01 at 4 37 49 PM" src="https://github.com/pytorch/pytorch/assets/3282513/ada0d381-1af0-4378-8e8b-2375b39c3713">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108456
Approved by: https://github.com/cpuhrsch
2023-09-02 03:13:17 +00:00
Bin Bao
06d74e6b24 Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349)
This reverts commit c3239442a3 due to internal test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349
Approved by: https://github.com/aakhundov, https://github.com/zhxchen17
2023-08-31 16:26:02 +00:00
Mu-Chu Lee
c3239442a3 [AOTInductor] Include constants in AOTInductor .so file. (#107718)
Summary:
Include the constants into AOTInductor .so file.
We do not modify existing API signatures but create necessary format with weight lifted out instead.

Test Plan:
test/inductor/test_aot_inductor.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718
Approved by: https://github.com/angelayi, https://github.com/eellison
2023-08-29 22:37:30 +00:00
PyTorch MergeBot
2f226804a0 Revert "Minor fixs to make torchbench runable on torch/xla (#107919)"
This reverts commit ed8f21282f.

Reverted https://github.com/pytorch/pytorch/pull/107919 on behalf of https://github.com/izaitsevfb due to Conflicts with the revert of 106914 ([comment](https://github.com/pytorch/pytorch/pull/107919#issuecomment-1696662453))
2023-08-29 02:18:07 +00:00
JackCaoG
ed8f21282f Minor fixs to make torchbench runable on torch/xla (#107919)
`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for https://github.com/pytorch/xla/issues/4174.

`is_correct` reference has been deleted, I think it is a deadcode.

After this patch, I am able to run

```
python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107919
Approved by: https://github.com/shunting314, https://github.com/wconstab
2023-08-26 03:34:54 +00:00
blzheng
1ea83f04d2 benchmark: convert output of fp64 to torch.float64 (#107375)
This PR adds converting the output of fp64 to torch.float64 before checking for accuracy.

Why we need this change?
For llama of torchbench, it converts output to float before returning it.
bad4e9ac19/torchbenchmark/models/llama/model.py (L241)

While in the correctness checker, it will not compare the res results with fp64_ref if the fp64_ref.dtype is not torch.float64. So llama fails the accuracy check in the low-precision case, even though res is closer to fp64_ref than ref.
e108f33299/torch/_dynamo/utils.py (L1025)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107375
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/jansel
2023-08-21 04:34:23 +00:00
Edward Z. Yang
5b9b816b17 WAR by avoid querying device before env mutation (#107301)
We should probably fix https://github.com/pytorch/pytorch/issues/107300
properly but this works around the problem

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107301
Approved by: https://github.com/bdhirsh, https://github.com/H-Huang, https://github.com/albanD
2023-08-17 00:31:16 +00:00
BowenBao
19a76290d8 [ONNX] Public diagnostic options for 'dynamo_export' (#106741)
Generate diagnostic reports to monitor the internal stages of the export process. This tool aids in unblocking model exports and debugging the exporter.

#### Settings

~~1. Choose if you want to produce a .sarif file and specify its location.~~
1. Updated: saving .sarif file should be done by `export_output.save_sarif_log(dst)`, similar to saving exported onnx model `export_output.save(model_dst)`.
2. Customize diagnostic options:
    - Set the desired verbosity for diagnostics.
    - Treat warnings as errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106741
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/malfet
2023-08-15 17:46:15 +00:00
Edward Z. Yang
5b04e9b6ce Install torchrec/fbgemm from source in CI (#106808)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106808
Approved by: https://github.com/malfet, https://github.com/xuzhao9
2023-08-12 02:08:44 +00:00
Howard Huang
656412f0cb Add multiprocess option to dynamo benchmarks (#106394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106394
Approved by: https://github.com/XilunWu
2023-08-11 18:34:09 +00:00
lezcano
a9dca53438 NumPy support in torch.compile (#106211)
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/

We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.

In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.

Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.

All the tests in `tests/torch_np` take about 75s to run.

This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
2023-08-11 00:39:32 +00:00
angelayi
5b13c779d4 [AOTInductor] Remove call to aot_autograd when receiving ExportedProgram (#105977)
https://github.com/pytorch/pytorch/issues/105555

Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.

```
def aot_compile(
    f: Callable,
    args: Tuple[Any],
    kwargs: Optional[Dict[str, Any]] = None,
    constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
2023-08-04 15:35:23 +00:00
angelayi
b2d3a2f433 [inductor] Remove ReinterpretView copy_ for AOT Inductor outputs (#106564)
Running benchmark on HF models result in 71% pass rate now: P802905571
Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Fri%2C%2028%20Jul%202023%2005%3A02%3A20%20GMT&stopTime=Fri%2C%2004%20Aug%202023%2005%3A02%3A20%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/bench&lCommit=e35a655e59b2038c0395f972a1f567f862093d9c&rBranch=main&rCommit=3e5a52cedd2d586fc6cb40a73a098252b9edc2a1)

Originally, a lot of the HF export-aot-inductor tests are failing with the error message:
```
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

I looked at the result of one of the models, AlbertForMaskedLM, and the error is due to an additional [`copy_`](https://www.internalfb.com/phabricator/paste/view/P802043305?lines=1460%2C1426%2C1438%2C1451%2C1428) being inserted at the end. Looking at the [exported graph](https://www.internalfb.com/phabricator/paste/view/P802908243?lines=1124), `buf237` in the cpp program corresponds to the `view_269` node. During inductor lowering, this `view_269` node will result in a `ir.ReinterpretView` node, and when generating code for the outputs, this [line](https://fburl.com/code/epola0di) will add an additional `copy_`.

I'm unsure if removing this case will result in other errors, but it seems to raise the HF model benchmark pass rate :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106564
Approved by: https://github.com/jansel
2023-08-04 07:51:29 +00:00
Mark Saroufim
6268ab2c2d torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009)
Includes stable diffusion, whisper, llama7b and clip

To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet

Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang

`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
2023-08-03 16:28:40 +00:00
Ubuntu
77e369b363 Run minification for TorchDynamo benchmark models that fail evaluation (#106201)
### Description
As an alternative to PR #105774, which provides a standalone, end-to-end minification script that covers all types of failures and has more functionality, this PR adds the ability to minify models when they fail the eval loop (accuracy checks). Both this PR and the other one can be merged without issue.

### Purpose
The goal is to leverage the minifier to minify models that fail accuracy checks, allowing failed models to be debugged more easily. The ideal use-case is trying to run a model suite on a backend where operator coverage is not known or is limited. If models can compile but fails the eval loop, having the repro script for each model is valuable for any developer that's trying to fix the issue.

### Functionality
- Create minify flag that minifies models when they fail accuracy check
- Produce minified graph for each model, and save it into repro script
- Move repro script to output directory/base Dynamo directory
- Enable functionality for running an entire model suite (Hugging Face, timm, and TorchBench) by prepending model name to repro script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106201
Approved by: https://github.com/ezyang
2023-08-03 03:34:04 +00:00
angelayi
6339f57fae Update export/export-aot-inductor benchmark code (#106323)
Update export/export-aot-inductor benchmark code to use recent changes related to kwarg inputs and dataclass outputs.

Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2031%20Jul%202023%2017%3A28%3A05%20GMT&stopTime=Tue%2C%2001%20Aug%202023%2017%3A28%3A05%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e)

80% pass rate on HF for export: P801372961
20% pass rate on HF for export-aot-inductor: [link](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Mon,%2031%20Jul%202023%2017:08:02%20GMT&stopTime=Tue,%2001%20Aug%202023%2017:08:02%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106323
Approved by: https://github.com/desertfire
2023-08-02 20:18:37 +00:00
Edward Z. Yang
0b8fbfe9de automatic_dynamic_shapes is on by default (#106188)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106188
Approved by: https://github.com/albanD
2023-07-28 13:26:54 +00:00
Mark Saroufim
c759a57003 Skip deterministic mode for SAM (#105615)
SAM uses cumsum which doesnt have a deterministic mode enabled so this the onl way I can work around this https://github.com/pytorch/pytorch/issues/89492

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105615
Approved by: https://github.com/eellison, https://github.com/cpuhrsch
2023-07-21 01:52:08 +00:00
Elias Ellison
024d26208c Add Freezing Option to Benchmarking (#105616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105616
Approved by: https://github.com/desertfire
2023-07-20 22:50:51 +00:00
Michael Lazos
690ea933ca Enable more e2e foreach optimizer compilation tests (#105438)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105438
Approved by: https://github.com/jansel
2023-07-20 02:41:19 +00:00
Justin Chu
5ef023b05a [BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429
Approved by: https://github.com/malfet
2023-07-19 04:46:37 +00:00
Bin Bao
b10de43c0a Add aot_inductor as a test backend for benchmarking (#105221)
Summary:
Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead.

Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend.

Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers
Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking.

Differential Revision: D47452591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221
Approved by: https://github.com/jansel
2023-07-18 13:16:36 +00:00
Edward Z. Yang
10cbc9a063 Enable cuda graphs for dynamic shapes (#105064)
The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits.

This was surprisingly easy to do.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064
Approved by: https://github.com/voznesenskym
2023-07-14 16:13:50 +00:00
Yukio Siraichi
6abe0b2ee8 Disable translation validation on performance runs. (#104887)
This PR disables translation validation (TV) when running the benchmark suits on
performance workflows: inductor with A100s.

In summary, the changes are:

- Add flag for turning TV on and off on _benchmarks/dynamo/common.py_
- Turn TV on only on CI accuracy builds
- Add `--no-translation-validation` target flag to _.ci/pytorch/test.sh_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104887
Approved by: https://github.com/ezyang
2023-07-11 17:30:40 +00:00
Yukio Siraichi
85cbe7e6fd Add timeout for translation validation instances. (#104654)
As of now, translation validation runs to its completion. However, Z3 is time
consuming. PR #104464, for example, disables translation validation for a few benchmarks.

Instead, this PR introduces a timeout for translation validation. In that case, Z3 will
return `unknown`, since it wasn't able to prove or disprove the assertions. Then, we log
it as a warning, but don't stop execution.

Here's a summary of the changes:

- Added an environment variable for turning translation validation on and off
- Added an environment variable for setting the translation validation timeout
- Possibly reverts the changes in #104464
- ~~Move from "QF_NRA" to "QF_NIRA" logic~~
    - ~~It makes more sense, given the nature of the problems~~
    - "QF_NRA" seems to solve more instances of _dynamo/test_dynamic_shapes.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104654
Approved by: https://github.com/ezyang
2023-07-08 19:19:00 +00:00
willfengg
202fb95c68 [benchmark][export] Add torch.export passrate for TB/TIMM benchmarks (#104382)
issues resolved: https://github.com/pytorch/pytorch/issues/104294

local test on TB and TIMM
* python benchmarks/dynamo/torchbench.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary
* python benchmarks/dynamo/timm_models.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary

why not HF
* huggingface use kwargs (dict) to torch.nn.module
* we will need to support kwargs in torch._export.export, which is in progress

local test result

timm 95% pass rate (58 ouf of 61 passed) P781702926
* 1 x [export specific]1 x ERROR:common:Mutating module attribute rel_indices during export
* 1 x[not relevant to export] Unknown model (SelecSls42b)
* 1 x [not relevant to export] Failed to load model: HTTP Error 409: Public access is not permitted on this storage account

torchbench 54% pass rate (41 out of 75 passed) P781690552
* 7 x ERROR:common:Dynamo input and output is a strict subset of traced input/output
* 3 x ERROR:common:call_method NNModuleVariable() / UserDefinedObjectVariable
* 3 x ERROR:common:Mutating module attribute {xx} during export.
* 2 x ERROR:common:inline in skipfiles
* 2 x ERROR:common:Consider annotating your code using constrain_as_*(). It appears that you're trying
* 1 x ERROR:common:guard on data-dependent symbolic int/float
* 1 x ERROR:common:Tensor.tolist
* 1 x ERROR:common:Tensor.numpy. Turn on config.numpy_ndarray_as_tensor and install torch_np to support tensor.numpy().  [may be dev * env?]
* 1 x ERROR:common:missing: BUILD_SET
* 1 x ERROR:common:whole graph export entails exactly one guard export
* 1 x ERROR:common:call_function BuiltinVariable(str) [GetAttrVariable(UserMethodVariable(<function
* 1 x ERROR:common:Dynamic slicing on data-dependent value is not supported
  * 1 x ERROR:common:Failed running call_function <function interpolate at 0x7f60a8361ea0>(*(FakeTensor(..., device='cuda:0', size=(1, 3, * 427,
* 1 x ERROR:common:Dynamo attempts to add additional input during export: value=0.6177528500556946, source=RandomValueSource(random_call_index=0)
* 1 x Found following user inputs located at [16, 17, 18, 19, 20, 21, 22] are mutated. This is currently banned in the aot_export workflow.
* 1 x RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation
* 4 x pass_due_to_skip
* 1 x eager_2nd_run_OOM
* 1 x fail_accuracy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104382
Approved by: https://github.com/zhxchen17
2023-07-06 17:16:07 +00:00
Yukio Siraichi
0cee4e3c32 Turn translation validation off on timeouts. (#104464)
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104464
Approved by: https://github.com/malfet
2023-07-05 19:01:50 +00:00
Yukio Siraichi
40b8d10d5e Re-land: Turn translation validation on for tests and accuracy runs by default. (#104467)
Re-landing: #103611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467
Approved by: https://github.com/malfet
2023-07-05 19:01:50 +00:00
Edward Z. Yang
2385dad4b3 Enable automatic_dynamic_shapes by default (#103623)
Some notes:

* I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops
* A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it
* expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623
Approved by: https://github.com/voznesenskym
2023-07-05 00:25:02 +00:00
PyTorch MergeBot
a2a8b4d415 Revert "Turn translation validation on for tests and accuracy runs by default. (#103611)"
This reverts commit e311bed2a8.

Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276))
2023-06-30 15:54:18 +00:00
Yukio Siraichi
e311bed2a8 Turn translation validation on for tests and accuracy runs by default. (#103611)
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.

The main changes are:

- Add `--no-translation-validation` as an option in _test/run_tests.py_
    - Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
2023-06-30 01:32:21 +00:00
BowenBao
c1a49823cd [ONNX] Bench torch.onnx.dynamo_export and torch.onnx.export under dynamo bench (#103135)
- Extend dynamo bench interface with '--compilers onnx' and '--compilers dynamo-onnx'
- ONNX bench exports model to onnx and runs in ONNX Runtime.
- Introduce error aggregation and report.
- Scripts to build ONNX deps and running ONNX bench.
- Huggingface accuracy check workaround for ONNX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103135
Approved by: https://github.com/thiagocrepaldi, https://github.com/jansel
2023-06-22 01:21:09 +00:00
Bin Bao
a2988c9e6a [CI] Switch inference accuracy and performance tests to bfloat16 (#103535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103535
Approved by: https://github.com/eellison
2023-06-17 00:24:37 +00:00
Edward Z. Yang
bc6ec97e02 Switch dynamic_shapes to True by default (#103597)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103597
Approved by: https://github.com/voznesenskym
2023-06-15 15:16:20 +00:00
Edward Z. Yang
5211fad738 cm3leon_generate is at edge of timeout, so bump it up (#103607)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103607
Approved by: https://github.com/malfet
2023-06-15 03:40:42 +00:00
PyTorch MergeBot
a60f6dbe69 Revert "Add groups to dynamo benchmarking output data (#103268)"
This reverts commit 455f542ed9.

Reverted https://github.com/pytorch/pytorch/pull/103268 on behalf of https://github.com/drisspg due to no longer needed ([comment](https://github.com/pytorch/pytorch/pull/103268#issuecomment-1591732331))
2023-06-14 17:50:34 +00:00
chuanqiw
3c5ac4baa4 [CI] Enable inductor dynamic accuracy test on cpu device (#103387)
Enable inductor dynamic accuracy test on cpu in ci workflow to capture issue early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103387
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire
2023-06-14 06:12:41 +00:00
BowenBao
45104cb67f Different csv headers by bench mode on infra error (#103134)
As title. The headers are different for distinct bench mode. This PR is a supplement
to https://github.com/pytorch/pytorch/pull/100372 to respect `performance` mode where numerical speedup is expected
instead of status text.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103134
Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang
2023-06-13 03:40:22 +00:00
Driss Guessous
455f542ed9 Add groups to dynamo benchmarking output data (#103268)
# Summary
Ads the required information to enable this issue:
https://github.com/pytorch/test-infra/issues/4268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103268
Approved by: https://github.com/huydhn
2023-06-12 21:09:42 +00:00
Edward Z. Yang
54daf870bc CUDA graphs overrides dynamic shapes and forces specialization (#103290)
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs.  This new strategy
I think is better.  The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input.  When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.

This obsoletes https://github.com/pytorch/pytorch/pull/101813

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym, https://github.com/eellison
2023-06-12 20:26:55 +00:00
Bin Bao
141828498c [CI] Update inference accuracy test (#103361)
Summary:
1) Switch inference accuracy test from fp32 to amp (consistent with dashboard run, https://github.com/pytorch/pytorch/pull/103220)
2) GoogleFnet fails in eager with amp or fp16, so fallback to always using fp32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103361
Approved by: https://github.com/eellison
2023-06-12 19:34:18 +00:00
Edward Z. Yang
c3fdfca5da Always create ShapeEnv, always apply unspec logic (#103302)
Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go:

* Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export.
* Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal
* Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset.

The rest are just auxiliary fixups from the above:

* Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up
* Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py)
* Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py)
* Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302
Approved by: https://github.com/voznesenskym
2023-06-12 12:48:28 +00:00
Edward Z. Yang
414ec6ce97 Turn off automatic_dynamic_shapes in prep for dynamic-by-default (#103320)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103320
Approved by: https://github.com/Skylion007
2023-06-10 02:49:59 +00:00
PyTorch MergeBot
d89dd05e4d Revert "CUDA graphs overrides dynamic shapes and forces specialization (#103290)"
This reverts commit c760f0e4dd.

Reverted https://github.com/pytorch/pytorch/pull/103290 on behalf of https://github.com/ezyang due to to handle the other cuda graphs case ([comment](https://github.com/pytorch/pytorch/pull/103290#issuecomment-1584977767))
2023-06-09 18:25:28 +00:00
Edward Z. Yang
c760f0e4dd CUDA graphs overrides dynamic shapes and forces specialization (#103290)
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs.  This new strategy
I think is better.  The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input.  When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.

This obsoletes https://github.com/pytorch/pytorch/pull/101813

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym
2023-06-09 17:43:47 +00:00
Will Constable
39201ce025 Make dynamo bench conditionally import DDP/FSDP (#103163)
Avoids hitting importerror for singlenode benchmarks when running on
a non-distributed build of pytorch.

Fixes #102086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103163
Approved by: https://github.com/lezcano, https://github.com/wanchaol
2023-06-08 19:10:49 +00:00
Elias Ellison
18e4a466db fix amp in inference in benchmarking suite (#103220)
Even if you passed in --amp we would run inference in float32.

`AlbertForMaskedLM` goes from 1.305 float32 to 1.724x amp, and then again to 1.910x with freezing. Benchmark numbers for amp are about to go way up lol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103220
Approved by: https://github.com/desertfire
2023-06-08 05:16:22 +00:00
Edward Z. Yang
eeb3c62117 Add Wav2Vec2 HuggingFace support (#103009)
This is not actually enabled in the benchmark suite as you need
https://github.com/pytorch/pytorch/pull/103001 and also training
is broken per https://github.com/pytorch/pytorch/issues/101160
but might as well review this part first.

Contains https://github.com/pytorch/pytorch/pull/102979 but
I will probably rebase past that once it lands.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103009
Approved by: https://github.com/Skylion007
2023-06-06 13:25:06 +00:00
Edward Z. Yang
cca7b38564 Don't allow skipping deepcopy (#102973)
We might mutate it afterwards!  This could lead to hard to understand
bugs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102973
Approved by: https://github.com/albanD
2023-06-05 20:01:16 +00:00
Vinay Kumar Burugu
8215468870 Feature:To add --tolerance option to benchmark scripts (#102218)
The "tolerance" option evaluates the model on the baseline device in eager mode (default: CPU) compared to the test device (e.g., CUDA, XLA, etc.) and compares the output tensors to determine the absolute tolerance value based on the [formula](https://pytorch.org/docs/stable/generated/torch.allclose.html). It then saves the results in a CSV file. This comparison highlights the tolerance/accuracy difference between XLA and GPU/CPU devices and can also be used to evaluate newer accelerators. This feature aims to identify accuracy failures on the test device (e.g., XLA) and facilitate quick bug triaging.

This feature enables the following capabilities:
1. Ability to monitor accuracy issues of backends
2. Provide more informative picture on accuracy beyond pass/ fail status
3. Having a dump of accuracy information will help triage models accordingly

The data generated using this feature is in the [spreadsheet](https://docs.google.com/spreadsheets/d/1A8BAzSqfAw0Q5rgzK5Gk__Uy7qhuynh8tedxKnH-t94/edit#gid=0).

The spreadsheet data can be used to compile the below summary table:

| Suite                     | Max Tolerance                |          | No. of models with high inaccuracy(>=0.005) |          | Mean Tolerance |          |
|------------------ |:-------------:|:--------:|:-------------------------------------------:|:--------:|:--------------:|:--------:|
|                             |      xla           | inductor      |                     xla     | inductor |                                                xla      | inductor |
| huggingface       |        0.1169  |   0.0032      |                            1 |        0 |                                                   0.0022 |   0.0005 |
| timm_models     |        0.0373 |   2.8892      |                          10 |        8 |                                                   0.0028 |   0.7044 |
| torchbench        |         3.013   |   3.0381       |                            6 |        2 |                                                    0.0016 |   0.0016 |
| All models          |         3.013   |   3.0381      |                           17 |       10 |                                                  0.0028 |   0.7044 |

I used PyTorch release/2.0 branch and corresponding [commit_pin](https://github.com/pytorch/pytorch/blob/release/2.0/.github/ci_commit_pins/xla.txt) for XLA to generate the above data.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102218
Approved by: https://github.com/jansel
2023-06-03 06:40:26 +00:00
Edward Z. Yang
624257890e Reenable hf_T5_generate (#102818)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102818
Approved by: https://github.com/albanD
2023-06-02 17:59:53 +00:00
Edward Z. Yang
7c00d45312 Reenable cm3leon_generate (#102793)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102793
Approved by: https://github.com/albanD, https://github.com/awgu
2023-06-02 15:15:26 +00:00
Animesh Jain
65631d4515 [benchmarks] Use train mode for accuracy checks for HF models (#102578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102578
Approved by: https://github.com/desertfire
2023-05-31 19:47:18 +00:00
Bin Bao
47b884a74c [inductor] Revert a CI remedy for Triton compilation error (#102541)
Summary: revert https://github.com/pytorch/pytorch/pull/91634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102541
Approved by: https://github.com/ngimel
2023-05-31 13:13:51 +00:00
Animesh Jain
33a49eeae7 [benchmark] Flag to switch on activation checkpointing for HF models (#102557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102557
Approved by: https://github.com/ngimel, https://github.com/Chillee
2023-05-30 23:46:14 +00:00
Horace He
e71ab21422 update triton pin (#101919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101919
Approved by: https://github.com/ngimel
2023-05-30 17:16:05 +00:00
Animesh Jain
040d2cc969 [dynamo] Some torchrec_dlrm related fixes (#101953)
Issue 1 of https://github.com/pytorch/pytorch/issues/101918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101953
Approved by: https://github.com/jansel
2023-05-28 17:56:08 +00:00
Bin Bao
ee33bae5c7 Fix an issue where checking sameness throw an exception (#102279)
Summary: currently the exception is caught by outside and marked as
infra_error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102279
Approved by: https://github.com/anijain2305
2023-05-25 19:49:23 +00:00
Jason Ansel
5ba16011d7 Suppress profiler spam in dynamo benchmarks (#101942)
Makes this stuff go away:
```
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101942
Approved by: https://github.com/shunting314, https://github.com/desertfire
2023-05-22 18:32:31 +00:00
Edward Z. Yang
22ca1a1124 Partially fix shape mismatch in vision_maskrcnn (#101477)
The bulk of the heavy lifting is happening in
https://github.com/pytorch/vision/pull/7592

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101477
Approved by: https://github.com/voznesenskym
2023-05-21 05:20:08 +00:00
drisspg
6f13d6892a Add meta support for multinomial (#101324)
# Summary
Found this when trying to compile the text gen loop of nanogpt here: b33289942b/torchbenchmark/models/nanogpt_generate/model.py (L322)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101324
Approved by: https://github.com/ngimel
2023-05-19 00:04:26 +00:00
Animesh Jain
794cc3952e adding moco to CI (#101098)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101098
Approved by: https://github.com/desertfire
2023-05-18 10:01:49 +00:00
chuanqiw
b315c9b5ab [CI] Enlarge memory for OOM models in inductor cpu HF accuracy test (#101395)
Change the Inductor CPU HF accuracy test node from `linux.4xlarge` (32GB) to `linux.24xlarge` (192GB) to enlarge the node memory. Also add 3 HF models back to CI test.

Fixes #101390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101395
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/huydhn
2023-05-18 09:23:30 +00:00
Jason Ansel
403ce1a1c9 Fix benchmark model names printouts with tqdm (#101627)
With the TQDM changes in #100969 -- the models names ended up getting hidden from the benchmark printouts.  We would print the model name with no newline, then tqdm would print a `\r` and overwrite the name of the running model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101627
Approved by: https://github.com/ezyang
2023-05-17 15:31:11 +00:00
PaliC
e0fc24cdc5 add retries to inductor benchmark suite (#101019)
This pr accomplishes
1) Enables retries for downloading torchbenchmark and huggingface models in a similar method to how we do it for timm models right now.
2) creates a `_download_model` function for the hugging face and TIMM runners whose output I plan to use to preload the models somewhere if possible (please double check I'll be saving the right thing). Instead of retries, we plan to just add torchbench to a docker image as it is relatively small.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 3361a4c</samp>

> _We're the brave and bold coders of the `common.py` module_
> _We've made a handy function for downloading models_
> _We've shared it with our mates in the other runners_
> _So pull and push and try again, we'll get them all in time_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101019
Approved by: https://github.com/huydhn, https://github.com/desertfire
2023-05-16 21:41:50 +00:00
Edward Z. Yang
41468833fb vision_maskrcnn is now deterministic (#101116)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101116
Approved by: https://github.com/ngimel
2023-05-16 21:32:17 +00:00
Yanbo Liang
e4eaf33346 Re-enable detectron2_maskrcnn on CI (#100791)
#99665 has been fixed, we can re-enable these models on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100791
Approved by: https://github.com/huydhn
2023-05-16 04:25:58 +00:00
Edward Z. Yang
f48718f749 Update torchbench pin (#101365)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101365
Approved by: https://github.com/albanD, https://github.com/awgu
2023-05-15 16:52:31 +00:00
Natalia Gimelshein
49578913fb update timm commit (#100931)
Fixes #100903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100931
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-05-12 04:22:08 +00:00
Edward Z. Yang
41a4e22015 Update torchbench pin (#101071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101071
Approved by: https://github.com/malfet
2023-05-11 18:09:40 +00:00
Jason Ansel
036a8d6b4a Remove NullContext() from benchmark runners (#100309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100309
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2023-05-11 06:42:27 +00:00
XiaobingSuper
c84627c2ee benchmarks: make --amp works for cpu path (#101057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101057
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-05-11 02:51:38 +00:00
Edward Z. Yang
c658732950 [RFC] Add tqdm to benchmarking script (#100969)
Here's what it looks like, on a slower running benchmark:

https://github.com/pytorch/pytorch/assets/13564/47c4a5bd-e963-45de-a15c-2fd943de0fa4

There's actually quite a bit of dead time, it's possible there are more spots we should add tqdm to. Looking for opinions on utility of this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100969
Approved by: https://github.com/Skylion007
2023-05-10 15:39:24 +00:00
Bin Bao
76cc3ab4f3 [CI] Delete skips from https://github.com/pytorch/pytorch/issues/93847 (#96049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96049
Approved by: https://github.com/jansel
2023-05-10 01:27:27 +00:00
Edward Z. Yang
9eab13fc90 Reenable llama benchmark (#100877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100877
Approved by: https://github.com/albanD
2023-05-09 01:12:54 +00:00
Natalia Gimelshein
9790f9174a skip lcnet (#100726)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100726
Approved by: https://github.com/voznesenskym
2023-05-05 23:19:42 +00:00
Animesh Jain
3f025c607c summarize graph breaks (#100696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100696
Approved by: https://github.com/yanboliang
2023-05-05 22:27:47 +00:00
Animesh Jain
8994d9e610 [dynamo] Hide guard_fail_hook behind a flag to improve cache lookup time (+10% DebertaV2) (#100590)
For TorchDynamo eager backend, DebertaV2 speedup improves from 0.77x to 0.87x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100590
Approved by: https://github.com/voznesenskym, https://github.com/wconstab
2023-05-04 18:52:21 +00:00
Yanbo Liang
896eb1db26 [Dynamo] Skip TB Background_Matting model eager accuracy check because of non deterministic (#100513)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100513
Approved by: https://github.com/anijain2305
2023-05-03 07:06:50 +00:00
Jason Ansel
fdc853b14c Add --baseline option to benchmark runners (#100266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100266
Approved by: https://github.com/ngimel
2023-05-02 02:35:11 +00:00
Edward Z. Yang
e918fd18e7 Disable densenet121 as it is flaky (#100371)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100371
Approved by: https://github.com/voznesenskym
2023-05-02 01:49:11 +00:00
Edward Z. Yang
5d93265cce Report timeout/infra_error instead of 0.0000 on infra error (#100372)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100372
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-01 14:56:01 +00:00
Huy Do
9a69634b28 Skip some failing dynamic shape models on periodic (#99895)
After some recent changes, these tests are failing in periodic trunk.  So let's move them to unstable while waiting for the team to root cause the issue https://github.com/pytorch/pytorch/issues/99893.  Note that a forward fix can use `ciflow/unstable` to run those unstable jobs to confirm that they are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99895
Approved by: https://github.com/malfet
2023-04-25 07:05:08 +00:00
Edward Z. Yang
04e8df4dd7 Return full accuracy status for printing, not abbreviated version (#99894)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99894
Approved by: https://github.com/jansel
2023-04-25 05:17:10 +00:00
Edward Z. Yang
cd61707167 yolov3 dynamic training accuracy is fixed (#99896)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99896
Approved by: https://github.com/albanD
2023-04-25 01:15:24 +00:00
chuanqiw
e9e5ffe83e Re-enable dynamic shapes test in dynamo benchmark (#99816)
Set `torch._dynamo.config.assume_static_by_default = False` for dynamic shapes flag enabled

Fixes #99815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99816
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-24 20:34:52 +00:00
Edward Z. Yang
f602b3a6ae Preserve mark_dynamic when cloning inputs (#99617)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99617
Approved by: https://github.com/ngimel, https://github.com/voznesenskym, https://github.com/anijain2305
2023-04-22 19:46:31 +00:00
Bin Bao
e09f785a72 [CI] Remove inductor skip list for Huggingface (#99375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99375
Approved by: https://github.com/anijain2305
2023-04-21 18:13:22 +00:00
Edward Z. Yang
fc8fa6c356 Require at least one tensor to be marked dynamic with --dynamic-batch-only (#99620)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99620
Approved by: https://github.com/voznesenskym
2023-04-21 00:17:08 +00:00
Huy Do
5315317b7b Skip some detectron2_maskrcnn models with KeyError _ignore_torch_cuda_oom (#99599)
These tests are failing in trunk 233cc34d3b with `KeyError: '_ignore_torch_cuda_oom'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99599
Approved by: https://github.com/malfet
2023-04-20 18:11:35 +00:00
Jason Ansel
3233450d07 Add TorchXLA option to benchmark runner (#99505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99505
Approved by: https://github.com/voznesenskym
2023-04-19 22:44:52 +00:00
Will Constable
9ac2b041c9 Make opacus xfail instead of skip (#99380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99380
Approved by: https://github.com/desertfire, https://github.com/anijain2305
2023-04-19 21:09:06 +00:00
Michael Voznesensky
113bd11cf4 Skip levit (#99491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99491
Approved by: https://github.com/ezyang
2023-04-19 07:41:42 +00:00
Edward Z. Yang
039faf0dbf Add invariant that all symbolic shapes must be bound in graph (#99089)
Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in https://github.com/pytorch/pytorch/pull/93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards.

With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well.

This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99089
Approved by: https://github.com/voznesenskym
2023-04-16 01:48:19 +00:00
Yanbo Liang
15fe5a0798 [Dynamo] Fix benchmark --verbose error (#99224)
Dynamo benchmark --verbose is broken:
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 400, in <module>
    torchbench_main()
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 396, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 1967, in main
    return maybe_fresh_cache(
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 993, in inner
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 2135, in run
    torch._dynamo.config.log_level = logging.DEBUG
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/config_utils.py", line 67, in __setattr__
    raise AttributeError(f"{self.__name__}.{name} does not exist")
AttributeError: torch._dynamo.config.log_level does not exist
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99224
Approved by: https://github.com/voznesenskym
2023-04-15 20:18:50 +00:00
Bin Bao
34f681c13b [CI] Remove inductor skip list for timm_models (#98840)
Summary: check against the expected csv file instead of skipping tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98840
Approved by: https://github.com/ezyang
2023-04-15 13:54:41 +00:00
Bin Bao
e5501a967e [inductor] Support IndexPutFallback in cpp_wrapper (#98972)
Summary:
1) Make the fallback index_put generate the right cpp code in cpp_wapper
2) Add a --cpp-wrapper option to common.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-13 15:41:03 +00:00
Edward Z. Yang
b8b840be3d Convert logging f-strings to use % format, part five (#98765)
This does some annoying but simple cases by hand.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765
Approved by: https://github.com/wanchaol
2023-04-11 13:17:59 +00:00
Edward Z. Yang
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
Edward Z. Yang
bdb79a8f52 Turn off divisible_by_16 for dynamic shapes; support ablation (#98471)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98471
Approved by: https://github.com/ngimel, https://github.com/voznesenskym
2023-04-06 12:57:07 +00:00
Edward Z. Yang
cf1bfca2ba Require batch dimensions to be compiled dynamically (#98334)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98334
Approved by: https://github.com/voznesenskym
2023-04-05 19:40:22 +00:00
Edward Z. Yang
b923f84805 Switch accuracy CI to dynamic batch only (#98307)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98307
Approved by: https://github.com/wconstab
2023-04-05 01:20:12 +00:00
Elias Ellison
a3365e1d0d Increment pending forwards after invocation (#98101)
Forwards are only pending following invocation, not before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98101
Approved by: https://github.com/ngimel
2023-04-05 00:04:39 +00:00
Bin Bao
69ff39d2e7 Skip gat, gcn and sage for TorchBench CUDA test (#98244)
Summary: The three models only support CPU for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98244
Approved by: https://github.com/ezyang
2023-04-04 01:06:18 +00:00
Jason Ansel
55afaa46a4 Support functools.partial and itertools.product (#98120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98120
Approved by: https://github.com/anijain2305
2023-04-03 18:23:25 +00:00
Bin Bao
ba7ee00f00 Add a --inference flag to dynamo benchmark script (#98173)
Summary: When calling benchmark scripts, make it a requirement to pass
--inference or --training

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98173
Approved by: https://github.com/huydhn
2023-04-03 17:12:28 +00:00
Jason Ansel
92b46202ef Add --stats option to benchmark scripts (#98109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98109
Approved by: https://github.com/anijain2305
2023-04-02 02:23:13 +00:00
Edward Z. Yang
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
Bin Bao
c699ac17df [CI] Bump up torchbench version to fix dynamo graph breaks in transformers (#98003)
Summary: When we bump up the torchbench version pin last time, we found
there were new graph breaks introduced with the trasformers version
upgrade, see https://github.com/pytorch/pytorch/pull/96782. Turns out
they are already fixed upstream, see
https://github.com/huggingface/transformers/pull/21648 and https://github.com/pytorch/benchmark/pull/1511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98003
Approved by: https://github.com/ngimel
2023-03-31 16:52:09 +00:00
Edward Z. Yang
97fc8ea5f4 Run the benchmark suite with dynamic batch only (#97912)
Symbolic shapes compile time on full CI with inductor is horribly long (even though our aot_eager local runs seemed to suggest that the added latency was only 10s per model.) To patch over the problem for now, run the benchmark suite with dynamic batch only.  This should absolve a lot of sins.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97912
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2023-03-30 18:04:48 +00:00
Aaron Gokaslan
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
William Wen
b93e1f377e [dynamo, benchmarks] Add inductor-mode (for max-autotune) and warm start options to dynamo benchmarks (#97719)
Title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97719
Approved by: https://github.com/shunting314
2023-03-29 21:09:00 +00:00
Edward Z. Yang
f754be897a Disable speedup_experiment_ds (#97806)
It seems to be broken.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97806
Approved by: https://github.com/jansel
2023-03-29 01:27:31 +00:00
Bin Bao
a9a81ab7e3 [CI] Run benchmark test with dynamo_eager in periodic (#97543)
Summary: The idea is to catch any dynamo_eager regression earlier, and also
we can take that off the dashboard run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97543
Approved by: https://github.com/huydhn
2023-03-28 01:02:49 +00:00
Shunting Zhang
652592efa9 [inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.

E.g., I tried it for nvidia_deeprecommender's infernece graph.

Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.

<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">

next step I'll check why the triton mm kernels are not picked.

EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel
2023-03-27 21:54:25 +00:00
Edward Z. Yang
cff4826f28 pytorch_unet is now passing (#97309)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97309
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-03-22 13:55:05 +00:00
Bin Bao
be49d3b170 [CI] Turn on debug logging for dla102 and gernet_l (#97307)
Summary: Log the generated code for those two flaky tests to see if
there is any codegen difference when they fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97307
Approved by: https://github.com/ezyang
2023-03-22 13:42:13 +00:00
Natalia Gimelshein
e7d9331688 [inductor] hoist symbolic padding expressions (#97099)
Towards fixing pnasnet5large, see #96709. The generated kernel looks much better
```
@pointwise(size_hints=[1048576], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32', 4: 'i32', 5: 'i32', 6: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 6), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, ks0, ks1, ks2, ks3, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x1 = (xindex // ks0) % ks0
    x0 = xindex % ks0
    x2 = (xindex // ks3)
    x4 = xindex
    tmp0 = x1 + ((-1)*ks1)
    tmp1 = 0
    tmp2 = tmp0 >= tmp1
    tmp3 = ks2
    tmp4 = tmp0 < tmp3
    tmp5 = x0 + ((-1)*ks1)
    tmp6 = tmp5 >= tmp1
    tmp7 = tmp5 < tmp3
    tmp8 = tmp2 & tmp4
    tmp9 = tmp8 & tmp6
    tmp10 = tmp9 & tmp7
    tmp11 = tl.load(in_ptr0 + (x0 + ((-1)*ks1) + (ks2*x1) + (x2*(ks2*ks2)) + ((-1)*ks1*ks2) + tl.zeros([XBLOCK], tl.int32)), tmp10 & xmask, other=0)
    tmp12 = tl.where(tmp10, tmp11, 0.0)
    tl.store(out_ptr0 + (x4 + tl.zeros([XBLOCK], tl.int32)), tmp12, xmask)
 ```
Interestingly, removing `expand` in in index `simplify` function makes `load` expression a little bit better, but `store` fails to simplify to flat store in this case, so I'm leaving `expand` in.
 Full pnasnet still chokes on `ceiling` in batch_norm kernels, additionally, it looks like shape propagation goofs in inductor and generates overly complicated expressions, we should switch to meta data from fx graph.
 I'm still not adding `ceil` print to triton, because we should be able to hoist all indexing expression (and just printing ceil without converting to int64 doesn't work)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97099
Approved by: https://github.com/jansel
2023-03-21 21:43:32 +00:00
Edward Z. Yang
e74c5e5637 rexnet_100 is disabled for static, does not need dynamic listing (#97100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97100
Approved by: https://github.com/Skylion007
2023-03-19 20:57:49 +00:00
Bin Bao
577d930c39 [CI] Revert https://github.com/pytorch/pytorch/pull/96195 (#96897)
Summary: https://github.com/pytorch/pytorch/pull/96195 was an experiment
for debugging flaky failures on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96897
Approved by: https://github.com/ngimel
2023-03-16 06:28:18 +00:00
Edward Z. Yang
3606f59366 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-16 02:54:18 +00:00
Will Constable
54cd4a67d0 Output peak memory stats from dynamo torchbench perf CI (#95666)
Adds absolute memory usage numbers (in addition to compression ratio) to performance jobs.

Example output:
<img width="1211" alt="image" src="https://user-images.githubusercontent.com/4984825/225419950-500908c5-00ce-4711-afa2-c995bf90d35d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95666
Approved by: https://github.com/ezyang, https://github.com/williamwen42
2023-03-15 19:24:47 +00:00
Bin Bao
33c7be360f [reland][CI] switch torchbench to a pinned version (#96782)
Summary: This is reland of https://github.com/pytorch/pytorch/pull/96553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96782
Approved by: https://github.com/huydhn
2023-03-15 12:46:36 +00:00
Edward Z. Yang
037acd5a22 Update CI skips (#96745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96745
Approved by: https://github.com/wconstab
2023-03-14 22:19:10 +00:00
PyTorch MergeBot
be4eaa69c2 Revert "[CI] switch torchbench to a pinned version (#96553)"
This reverts commit 61d6ccd29a.

Reverted https://github.com/pytorch/pytorch/pull/96553 on behalf of https://github.com/desertfire due to land race
2023-03-14 21:39:45 +00:00
PyTorch MergeBot
ba4fb9b6ad Revert "Default specialize_int to False (#96624)"
This reverts commit 1ac8782db2.

Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py
2023-03-14 19:43:47 +00:00
Bin Bao
61d6ccd29a [CI] switch torchbench to a pinned version (#96553)
Summary: Previously we were using a branch on torchbench which skips
torchaudio. We should switch to make sure a good test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96553
Approved by: https://github.com/huydhn, https://github.com/ezyang
2023-03-14 18:42:22 +00:00
Edward Z. Yang
1ac8782db2 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-14 18:37:47 +00:00
David Berard
6e3d51b08a [inductor][CI] also skip rexnet_100 on non-dynamic shapes (#96691)
Recent failures show rexnet_100 accuracy is flaky also on non-dynamic shapes (was already disabled for dynamic shapes in #96474). The failure occurs for the same reason (stem.bn.weight.grad).
e.g. https://github.com/pytorch/pytorch/actions/runs/4402868441/jobs/7710977874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96691
Approved by: https://github.com/desertfire
2023-03-14 18:11:59 +00:00
Edward Z. Yang
ff7e510d1e Correctly use PythonPrinter for generating wrapper code referencing sympy (#96710)
Otherwise you get stuff like ceiling(s0) which is not valid Python code. Fixes volo_d1_224

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96710
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-14 14:35:52 +00:00
Wang, Eikan
3cad8d23d0 [Inductor] Skip the hf_T5_base due to intermittent failure on CI (#96649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96649
Approved by: https://github.com/desertfire
2023-03-14 07:40:20 +00:00
Edward Z. Yang
507feb805f Don't specialize torch.Size with specialize_int = False (#96419)
Fixes https://github.com/pytorch/pytorch/issues/95868

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96419
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-03-14 01:32:58 +00:00
Edward Z. Yang
c7f39c0820 Update CI skips (#96554)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96554
Approved by: https://github.com/janeyx99
2023-03-13 13:40:45 +00:00
David Berard
29cd60dfb7 [CI] handle more dynamo benchmark models that are not expected to be deterministic (#96324)
Follow-up to #96245. alexnet, Background_Matting, vision_maskrcnn, and vgg16 all have the same problem; but on float32 they were also failing on the previous day so I missed this. Once the amp jobs became available I could see that these have the same issue (on both float32 and amp).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96324
Approved by: https://github.com/desertfire
2023-03-10 18:15:34 +00:00
Bin Bao
a651e6253a [CI] Change compile_threads to 1 when running benchmark accuracy test on CI (#96195)
Summary: This is not a pretty solution, but it a way to verify if the flakiness is coming from parallel compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96195
Approved by: https://github.com/ngimel
2023-03-10 17:39:38 +00:00
Edward Z. Yang
ff2e14f200 Skip rexnet_100 in dynamic CI (#96474)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96474
Approved by: https://github.com/yanboliang, https://github.com/msaroufim
2023-03-10 01:23:19 +00:00
Edward Z. Yang
c988de1040 [EASY] Update inductor training dynamic skips (#96298)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96298
Approved by: https://github.com/Chillee, https://github.com/janeyx99
2023-03-08 19:31:46 +00:00
Bin Bao
b3a079810e [CI] Add a workflow for quick perf comparison (#96166)
Summary: ciflow/inductor-perf-test-nightly now contains full dashboard
run which takes a very long time. Ed proposed a simplification of the
perf run there, but it is still worth to have a set of fast perf test
which only includes one configuration (--training --amp).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96166
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-08 19:09:04 +00:00
Bin Bao
664381b293 [CI] Avoid calling torch.use_deterministic_algorithms for some models (#96245)
tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96245
Approved by: https://github.com/davidberard98
2023-03-08 03:35:32 +00:00
Edward Z. Yang
d0641ed247 [TEST] Turn on unspecialize int dynamic training inductor CI (#96058)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96058
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-07 16:08:45 +00:00
Edward Z. Yang
a6e3e7905e Turn on unspecialize int dynamic inductor CI (#96034)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96034
Approved by: https://github.com/voznesenskym
2023-03-07 12:39:55 +00:00
Jason Ansel
95d17dc93d [inductor] Reland #95567 part 1 (#96023)
This is the non-problematic part of #95567.  The errors were coming from
IR printing changes which will be next in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96023
Approved by: https://github.com/ngimel, https://github.com/mlazos
2023-03-06 22:57:22 +00:00
Edward Z. Yang
1fd7ea1ba8 Update skips for RecursionError (#96109)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96109
Approved by: https://github.com/huydhn
2023-03-06 17:55:38 +00:00
Bin Bao
60cf95610d [CI] Skip xcit_large_24_p8_224 in TIMM (#96048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96048
Approved by: https://github.com/jansel
2023-03-05 14:54:46 +00:00
Bin Bao
1359d16fe8 [CI] Further tighten the checking of two eager runs (#95902)
Summary: To catch nondeterminism in eager if there is any.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902
Approved by: https://github.com/jansel
2023-03-05 14:53:02 +00:00
Edward Z. Yang
c7c4a20321 Update dynamic skips (#95966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-04 23:01:58 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
Edward Z. Yang
d303665d33 Make int unspecialization actually work (#95621)
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.

The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.

* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.

Fixes https://github.com/pytorch/pytorch/issues/95469

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-03-04 01:22:08 +00:00
Jason Ansel
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
Bin Bao
879f0c3fee [CI] Increate the timeout limit for benchmark test (#95787)
Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up
the limit to reduce flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2023-03-01 19:54:25 +00:00
Bin Bao
e79b2b7792 [CI] Force clear triton cache between running each test (#95729)
Summary: The idea is to see if this reduces some of the flakiness
we have seen on CI. If it does help, then we have a problem in our
caching implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729
Approved by: https://github.com/ngimel
2023-03-01 04:10:03 +00:00
Will Constable
1a72712645 Add dynamo graph break stats to CI (#95635)
Adds columns to csv produced by accuracy job including dynamo graph break stats.

Example output from torchbench CI job:
<img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635
Approved by: https://github.com/ezyang
2023-02-28 16:17:46 +00:00
Edward Z. Yang
3762e801ba Update dynamic skips (#95587)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587
Approved by: https://github.com/voznesenskym
2023-02-28 03:26:55 +00:00
Bin Bao
fa5a4b0dfc [CI] Do not compare two eager run results against fp64 result (#95616)
Summary: When running the benchmark test with --accuracy, two eager runs
should return the same result. If not, we want to detect it early, but
comparing against fp64_output may hide the non-deterministism in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616
Approved by: https://github.com/ZainRizvi
2023-02-27 20:11:21 +00:00
Bin Bao
ab1ab3ab19 [CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478
Approved by: https://github.com/ezyang
2023-02-25 18:54:12 +00:00
Bin Bao
4c8ad93a7c [Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473)
Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473
Approved by: https://github.com/anijain2305
2023-02-24 18:21:36 +00:00
Will Constable
8de4238a31 Add dynamo bench arg --per_process_memory_fraction (#95260)
Simply pipes the arg to the existing torch.cuda API by the same name.

Useful for locally debugging OOMs that happened on a smaller GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260
Approved by: https://github.com/davidberard98
2023-02-22 05:11:18 +00:00
Edward Z. Yang
08370ddad8 Update model skips (#95089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089
Approved by: https://github.com/albanD
2023-02-20 13:24:49 +00:00
Wang, Eikan
954c767bc6 [Inductor] Enable accuracy test for CPPBackend (#94898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-20 05:02:15 +00:00
Edward Z. Yang
a2f44d82f8 Flag guard unbacked SymInt/SymFloat support (#94987)
I believe this fixes the AllenaiLongformerBase problem in periodic.

The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks.

What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-17 00:25:05 +00:00
Edward Z. Yang
7aaebe00ee Fail dynamic_aot_eager AllenaiLongformerBase model (#94986)
```
GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is Eq(i3, -1).  Scroll up to see where each of these data-dependent accesses originally occurred.

While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)})
Original traceback:
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value>
    chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986
Approved by: https://github.com/albanD
2023-02-16 20:02:46 +00:00
Aaron Gokaslan
0444a6c90a [BE] Remove deprecated logging warn method (#94708)
Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708
Approved by: https://github.com/ezyang
2023-02-13 18:24:52 +00:00
Edward Z. Yang
ae7a628b03 Dynamic shapes CI updates (#94690)
Data from https://github.com/pytorch/pytorch/pull/94683

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690
Approved by: https://github.com/cpuhrsch
2023-02-13 18:20:12 +00:00
PyTorch MergeBot
10c430ba0a Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 2a5851735a.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation
2023-02-10 04:40:32 +00:00
Bin Bao
2a5851735a Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-09 23:43:13 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Edward Z. Yang
c028fc4e25 Decouple PT2 dynamic shapes from the functorch setting (#94469)
The functorch setting still exists, but now it is no longer necessary:
we infer use of Python dispatcher by checking if the ambient
FakeTensorMode has a ShapeEnv or not.  The setting still exists,
but it is for controlling direct AOTAutograd use now; for PT2,
it's sufficient to use torch._dynamo.config.dynamic_shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469
Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel
2023-02-09 06:41:41 +00:00
PyTorch MergeBot
ca63040d2b Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 7bfc59993d.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk 7bfc59993d running out of memory.  Mark this as weird because it was green in PR
2023-02-09 01:24:35 +00:00
Bin Bao
7bfc59993d Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-08 23:30:10 +00:00
Jason Ansel
eb1aca162e Re-enable cudagraphs for benchmark scripts (#94192)
Related to https://github.com/pytorch/pytorch/pull/93253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94192
Approved by: https://github.com/albanD, https://github.com/desertfire
2023-02-08 16:38:32 +00:00
chuanqiw
94394e568e change the dynamo benchmark timeout as a parameter (#94284)
Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely.

Fixes #94281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284
Approved by: https://github.com/malfet
2023-02-08 00:45:08 +00:00
Bin Bao
db011e11ea Skip sebotnet33ts_256 on CI (#94067)
Summary: Random failure on CI and it happens more frequently lately.
Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-06 14:58:54 +00:00
Edward Z. Yang
1d53123f44 Report graph breaks separately from graph count (#94143)
graph break != graph count - 1.  Suppose you have a nested
inline function call f1 to f2 to f3.  A graph break in f3
results in six graphs: f1 before, f2 before, f3 before, f3 after,
f2 after, f1 after.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143
Approved by: https://github.com/voznesenskym
2023-02-05 04:03:12 +00:00
Edward Z. Yang
c1da35af5e Update dynamic benchmark skips (#94114)
Data from https://github.com/pytorch/pytorch/pull/94134

Signed-off-by: Edward Z. Yang <ezyangmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114
Approved by: https://github.com/SherlockNoMad
2023-02-04 20:36:51 +00:00
Jason Ansel
e071d72f3c Tag dynamo backends as debug/experimental (#93878)
Hides debug/experimental backends by default.

Before:
```
torch._dynamo.list_backends()
['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm']
```

After:
```
torch._dynamo.list_backends()
['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm']
```

Fixes https://github.com/pytorch/pytorch/issues/93733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878
Approved by: https://github.com/voznesenskym
2023-02-04 00:50:51 +00:00
Jason Ansel
0a93e6db5a Fix/refactor dynamo ipex backend (#93863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863
Approved by: https://github.com/desertfire
2023-02-03 21:42:27 +00:00
Jason Ansel
203b2cad3e Remove fx2trt/torch2trt backends (#93822)
These backends have been broken for some time.  I tried to get them
running again, but as far as I can tell they are not maintained.
Installing torch_tensorrt downgrades PyTorch to 1.12.  If I manually
bypass that downgrade, I get import errors from inside fx2trt.  Fixes that
re-add these are welcome, but it might make sense to move these wrappers
to the torch_tensorrt repo once PyTorch 2.0 support is added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822
Approved by: https://github.com/frank-wei
2023-02-03 21:04:21 +00:00
Jason Ansel
a5ff40032d Fix/refactor dynamo onnxrt backend (#93818)
Fixes https://github.com/pytorch/pytorch/issues/90352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818
Approved by: https://github.com/voznesenskym
2023-02-03 20:48:02 +00:00
Edward Z. Yang
2481fc0df4 Add count to FakeTensorMode.__torch_dispatch__ (#93936)
Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-02-03 14:21:11 +00:00
Fabio Rocha
63115b70f0 Fixed issue with --diff-branch arg in dynamo benchmarks (#93989)
As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio`
and `compression_latency` when you used `--diff-branch`.

This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch.

Also added a couple of more significant figures
to numbers in summary table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989
Approved by: https://github.com/jansel
2023-02-03 08:36:57 +00:00
Jason Ansel
60e8c766b5 Refactor dynamo training backends (#93409)
This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409
Approved by: https://github.com/ezyang
2023-02-03 03:07:15 +00:00
atalman
6e285c479d Remove cuda 11.6 from CI replace with 11.7 (#93406)
Remove cuda 11.6 from CI replace with 11.7
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406
Approved by: https://github.com/malfet, https://github.com/desertfire
2023-02-02 19:16:05 +00:00
Jason Ansel
d7b39b17ab Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279
Approved by: https://github.com/voznesenskym
2023-02-02 02:34:36 +00:00
Edward Z. Yang
03b465a6d0 Add --iterations to benchmark script (#93858)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858
Approved by: https://github.com/williamwen42
2023-02-01 21:56:49 +00:00
Edward Z. Yang
08041c5264 Configurable repro_tolerance for same_two_models (#93398)
Fixes https://github.com/pytorch/pytorch/issues/93293

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398
Approved by: https://github.com/SherlockNoMad
2023-02-01 01:41:48 +00:00
Edward Z. Yang
811e95a15e --dynamic-ci-skips now works for all backends (#93369)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369
Approved by: https://github.com/albanD
2023-01-31 20:07:58 +00:00
Edward Z. Yang
efee879695 Don't suppress warnings in CI. (#93269)
Warnings are an important clue that something bad is going on.
You want to see them in logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269
Approved by: https://github.com/voznesenskym
2023-01-30 19:21:09 +00:00
Edward Z. Yang
9eb402d18e Update dynamic benchmark skips (#93228)
Data from https://github.com/pytorch/pytorch/pull/93223

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228
Approved by: https://github.com/desertfire
2023-01-30 14:22:53 +00:00
XiaobingSuper
9a2becf60a inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247)
For TIMM ghostnet_100, there has such case, concat+inplace_add:

```
import torch
from torch._inductor import config
config.debug = True
torch._dynamo.config.verbose=True

class MockModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y, z):
        out = torch.cat([x, y], dim=1)
        out+=z
        return out

mod = MockModule().eval()
inputs = (
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 128, 16, 16]),
            )
ref = mod(*inputs)

with torch.no_grad():
    opt_model = torch._dynamo.optimize('inductor')(mod)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
print(torch.equal(ref, out))
```

the inductor always get a wrong result, I find that inductor get a wrong code:

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       const float* __restrict__ in_ptr2,
                       const float* __restrict__ in_ptr3,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
            tmp0.store(out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr0[i0];
            out_ptr0[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
            tmp0.store(out_ptr1 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr1[i0];
            out_ptr1[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<2048; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16*i0);
            auto tmp2 = tmp0 + tmp1;
            tmp2.store(out_ptr2 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=32768; i0<32768; i0+=1)
        {
            auto tmp0 = in_ptr2[i0];
            auto tmp1 = in_ptr3[i0];
            auto tmp2 = tmp0 + tmp1;
            out_ptr2[i0] = tmp2;
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1))  # alias
    buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384)  # alias
    buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr()))
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1, arg2_1]))

```
you can see that the add operation always adds a random value, see the ir code:

1. **ir_pre_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```
2. **ir_post_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store
```

From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel.

After this PR, the ir will be like(**ir_pre_fusion.txt**):

```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf2']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf2']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
buf3.mutations = ['buf2']
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247
Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel
2023-01-29 05:35:21 +00:00
Edward Z. Yang
025ef99ddf Get rid of dedicated inductor dynamic_shapes config (#93076)
Instead, use Dynamo dynamic_shapes config

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076
Approved by: https://github.com/voznesenskym
2023-01-27 02:58:16 +00:00
Edward Z. Yang
5e9fa0a8fc Mark crossvit_9_240 as passing dynamic=True (#92981)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981
Approved by: https://github.com/Chillee
2023-01-26 13:05:37 +00:00
Michael Voznesensky
d322f82b05 Add @count util to torch, use it to track benchmark stats (#93013)
<img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013
Approved by: https://github.com/ezyang
2023-01-26 03:09:12 +00:00
Edward Z. Yang
2ee94633a1 Change ciflow/inductor to test inductor inference with dynamic shapes (#92771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771
Approved by: https://github.com/voznesenskym
2023-01-25 02:21:02 +00:00
Edward Z. Yang
f724ecbd52 Add dynamic shapes aot_eager to periodic (#92770)
This means it overlaps with ciflow/inductor, but I'm about
to change that soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770
Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire
2023-01-25 02:21:02 +00:00
Edward Z. Yang
fb46d3e138 Run all of the timm models shards in the periodic (#92900)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92900
Approved by: https://github.com/bdhirsh, https://github.com/atalman
2023-01-24 17:56:20 +00:00
Horace He
c0327eb463 Some more inductor fixes for symbolic shapes (#92867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92867
Approved by: https://github.com/ezyang
2023-01-24 15:05:46 +00:00
PyTorch MergeBot
2cf03bbbab Revert "Run all of the timm models shards in the periodic (#92743)"
This reverts commit de69cedf98.

Reverted https://github.com/pytorch/pytorch/pull/92743 on behalf of https://github.com/atalman due to This needs to be landed after https://github.com/pytorch/pytorch/pull/92845 and https://github.com/pytorch/pytorch/pull/92846 are landed
2023-01-23 23:44:09 +00:00
Fabio Rocha
a43b55e135 A few usability improvements for the dynamo benchmarks. (#92713)
--diff_main renamed to --diff-branch BRANCH and now works again
Summary table splits results per branch.
csv output now has column with branch name when run in this mode

Added --progress flag so you can track how many models are going to be
run.

Example output:
```
$ python benchmarks/dynamo/torchbench.py  --quiet --performance --backend inductor --float16 --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)   --filter 'alexnet|vgg16' --progress  --diff viable/strict
Running model 1/2
batch size: 1024
cuda eval  alexnet                             dynamo_bench_diff_branch   1.251x p=0.00
cuda eval  alexnet                             viable/strict              1.251x p=0.00
Running model 2/2
batch size: 128
cuda eval  vgg16                               dynamo_bench_diff_branch   1.344x p=0.00
cuda eval  vgg16                               viable/strict              1.342x p=0.00

Summary for tag=dynamo_bench_diff_branch:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.09x mean=25.26x
compilation_latency mean=2.0 seconds
compression_ratio   mean=0.9x

Summary for tag=viable/strict:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.11x mean=25.29x
compilation_latency mean=0.5 seconds
compression_ratio   mean=1.0x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92713
Approved by: https://github.com/jansel
2023-01-23 18:23:35 +00:00
Edward Z. Yang
4a3fb7bcbc Make CI_SKIPS into a consolidated dict (#92769)
This makes it easier to add more configurations without causing a
thicket of if statements selecting the correct variable.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92769
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
Edward Z. Yang
3cfd2fa1c7 Make --inductor imply --backend inductor (#92764)
This is to make some downstream code more uniform (can always ask args.backend for backend)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92764
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
Edward Z. Yang
c52567ec18 Switch CI exclusions to use exact match. (#92761)
Since the CI exclusions are hard-coded in our script, we might as well require them to match exactly. This solved some head scratching where I was like, "this model is not obviously excluded, why is it not showing up in CI."

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92761
Approved by: https://github.com/jansel
2023-01-22 17:10:20 +00:00
Edward Z. Yang
de69cedf98 Run all of the timm models shards in the periodic (#92743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92743
Approved by: https://github.com/kit1980
2023-01-21 18:39:17 +00:00
Michael Voznesensky
5778c04a15 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-21 10:52:13 +00:00
Edward Z. Yang
27bf879b8c Forward fix: restore sebotnet33ts_256 aot_eager skip (#92741)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92741
Approved by: https://github.com/kit1980
2023-01-21 08:10:23 +00:00
Edward Z. Yang
9ad0aca6e5 Update aot_eager CI failures (#92696)
Based on https://hud.pytorch.org/pr/92689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92696
Approved by: https://github.com/desertfire
2023-01-21 02:29:22 +00:00
PyTorch MergeBot
44132cc4b0 Revert "Add --timing flag, phase timing to @dynamo_timed (#92637)"
This reverts commit 773b513435.

Reverted https://github.com/pytorch/pytorch/pull/92637 on behalf of https://github.com/malfet due to Broke lint
2023-01-20 16:23:20 +00:00
Michael Voznesensky
773b513435 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-20 05:01:21 +00:00
Edward Z. Yang
44e52ea514 Reenable mobilevit_s in CI, seems to pass (#92585)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92585
Approved by: https://github.com/Chillee
2023-01-19 15:24:45 +00:00
Edward Z. Yang
b92a7afed9 Reclassify some dynamic aot_eager failures as static failures (#92376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92376
Approved by: https://github.com/Chillee
2023-01-18 19:27:11 +00:00
Wu, Chunyuan
3aa6cec18c [dynamo] exclude reset_rng_state when measure timing (#92237)
Fixes inductor performance regression on CPU: https://github.com/pytorch/torchdynamo/issues/2027, https://github.com/pytorch/torchdynamo/issues/2028 and https://github.com/pytorch/torchdynamo/issues/2029.
The details are explained here: https://github.com/pytorch/torchdynamo/issues/2028#issuecomment-1381496678.

### Performance

- Model: lennard_jones
- Machine: IceLake (32 cores per socket)
- Configuration: single instance, 32 cores per instance
- jemalloc and iomp enabled

```bash
python benchmarks/dynamo/torchbench.py  --inductor-settings --inductor --performance --float32 -dcpu -n5000  --no-skip --dashboard --only=lennard_jones --quiet
```

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

Time before   regression | Time after regression | Time with this PR
-- | -- | --
0.00020483799744397402 | 0.0002818034990923479 | 0.00020241099991835654

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92237
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-18 13:17:28 +00:00
Edward Z. Yang
fbbb19599a Update dynamic skips after #92076 (#92103)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92103
Approved by: https://github.com/voznesenskym, https://github.com/Chillee
2023-01-13 04:05:10 +00:00
Edward Z. Yang
74cbf058a5 Support --dynamic-ci-skips (#91893)
This makes it easier for us to run only the skipped benchmarks and
see if that actually started passing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91893
Approved by: https://github.com/albanD
2023-01-11 20:02:58 +00:00
Edward Z. Yang
d24324bf1d s/INDCUTOR/INDUCTOR/ (#91885)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91885
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
2023-01-11 12:28:19 +00:00
Edward Z. Yang
56ed976edf hrnet_w18, tts_angular works with dynamic shapes (#91891)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91891
Approved by: https://github.com/voznesenskym
2023-01-11 11:40:16 +00:00
blzheng
0c1777acec Dynamo benchmark: add CPU specific changes (#88477)
This pr adds some CPU specific changes:

- Add support for IPEX backend
- https://github.com/pytorch/torchdynamo/issues/1618
- https://github.com/pytorch/torchdynamo/issues/1534
- Enable CPU launcher in runner.py.
- Fix the issue that some environment variables are not support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-07 09:26:06 +00:00
Shunting Zhang
a5f32f8978 training support for dynamo+torchxla integration (#88449)
We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla.

Training is trickier than inference and we may not expect much perf gains since
1. in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in `torchxla_trace_once` bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations.
2. in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency.
3. in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces.

But we still want to add training support to dynamo/torchxla to make the work complete.

We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above.

Results:

With '--iterations-per-run' equals to 1, here are the perf numbers:
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |             0.91   |                0.959    |
+-------------------------+--------------------+-------------------------+
| resnet50                |             0.917  |                0.932    |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |             0.912  |                0.905    |
+-------------------------+--------------------+-------------------------+
| alexnet                 |             1.038  |                0.974    |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |             0.881  |                0.835    |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |             0.903  |                0.931    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |             0.914  |                0.967    |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |             1.359  |                0.84     |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |             1.288  |                0.893    |
+-------------------------+--------------------+-------------------------+
| geomean                 |             1.0006 |                0.913794 |
+-------------------------+--------------------+-------------------------+
```

Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run'

NOTE:
In torchbench.py I added the following code to do a few workaround:
```
from myscripts import workaround # TODO will remove this line before landing
```

Here are the content of workaround.py:
```
import torch
from torch import nn
import os

# override max_pool2d with avg_pool2d
if os.environ.get("REPLACE_MAXPOOL", "0") == "1":
    torch.nn.MaxPool2d = torch.nn.AvgPool2d

```

It work around a few issues we found
1. MaxPool2d does not work for training in dynamo/torchxla: https://github.com/pytorch/torchdynamo/issues/1837 . WIP fix from Brian in https://github.com/pytorch/pytorch/pull/90226 , https://github.com/pytorch/xla/pull/4276/files (WIP)
2. recent change ( this PR https://github.com/pytorch/pytorch/pull/88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in https://github.com/pytorch/xla/pull/4282#event-7969608134 . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd)
3. we have issue to handle dropout because of random seed out of sync issue. Here is the fix: https://github.com/pytorch/xla/pull/4293 (confirmed the fix)

Example command:
```
REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88449
Approved by: https://github.com/wconstab, https://github.com/qihqi, https://github.com/malfet
2023-01-05 19:59:34 +00:00
Bin Bao
6bf0e3b697 [inductor] Check for BackendCompilerFailed on CI (#91634)
Summary: https://github.com/pytorch/pytorch/pull/91283/ skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91634
Approved by: https://github.com/ngimel
2023-01-03 22:38:29 +00:00
Animesh Jain
a32916190d buck-related minifier work (#91215)
Summary: Extending the minifier to generate buck target

Test Plan: N/A

Differential Revision: D42173893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91215
Approved by: https://github.com/bertmaher, https://github.com/ngimel
2022-12-22 19:33:50 +00:00
Bin Bao
07c61685c8 [inductor] CI improvments (#91283)
Summary:
1) Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
2) Skip Triton failure instead of retry
3) Some minor script cleanup is also included in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91283
Approved by: https://github.com/anijain2305
2022-12-22 15:37:43 +00:00
Michael Lazos
2f5759eaba Disable non-deterministic models for optimizers (#91149)
These two models are non-deterministic even with constant inputs + weights and sometimes fail with variations between the fp64 and fp32 models in CI very rarely as a result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91149
Approved by: https://github.com/desertfire
2022-12-20 20:19:54 +00:00
Bin Bao
84e73e1269 [inductor] small CI improvements (#91140)
Summary: 1) Increase timm_model download retry times; 2) Skip certain
random triton failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91140
Approved by: https://github.com/williamwen42
2022-12-20 17:26:12 +00:00
Michael Lazos
07c340bb2a Remove debug code (#91148)
Removes some debug code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91148
Approved by: https://github.com/desertfire, https://github.com/williamwen42
2022-12-20 15:00:55 +00:00
Bin Bao
2a37ba8e81 [inductor] Add retry after benchmark test fails on CI (#90808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90808
Approved by: https://github.com/malfet
2022-12-19 18:10:55 +00:00
Michael Lazos
1accd915a4 Re-enable optimizers (#90709)
Fixes
https://github.com/pytorch/pytorch/issues/90165
https://github.com/pytorch/torchdynamo/issues/328

Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed

and it has speedups, numbers to come soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang
2022-12-19 04:07:41 +00:00
Edward Z. Yang
212873c615 Add dynamic shapes benchmark accuracy to CI (#90444)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444
Approved by: https://github.com/voznesenskym
2022-12-17 11:17:20 +00:00
PyTorch MergeBot
e2377c8300 Revert "Add dynamic shapes benchmark accuracy to CI (#90444)"
This reverts commit 85db031e60.

Reverted https://github.com/pytorch/pytorch/pull/90444 on behalf of https://github.com/ezyang due to lint failing
2022-12-17 07:18:07 +00:00
Edward Z. Yang
85db031e60 Add dynamic shapes benchmark accuracy to CI (#90444)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444
Approved by: https://github.com/voznesenskym
2022-12-17 06:39:45 +00:00