Commit Graph

476 Commits

Author SHA1 Message Date
Katarzyna Fojcik
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
Aaron Orenstein
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
Animesh Jain
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
PyTorch MergeBot
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e573.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
Katarzyna Fojcik
9da376daa6 Add retain-output argument (#145921)
This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921
Approved by: https://github.com/jansel
2025-02-05 19:45:09 +00:00
Justin Chu
9756c7d788 [benchmark] Remove ONNX (#146325)
ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325
Approved by: https://github.com/titaiwangms
2025-02-04 04:02:47 +00:00
Huy Do
f38d5b4a74 Update TorchBench commit to main (#145455)
I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit.

The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584

The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change.

* `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100
* `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455
Approved by: https://github.com/kit1980
2025-02-01 06:44:26 +00:00
IvanKobzarev
894ef8c1e3 [torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623)
Issue:
https://github.com/pytorch/pytorch/issues/144888

Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`

`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`

convolution folding results in increase of error almost at one order of magnitude.

I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?

At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
2025-01-30 12:46:35 +00:00
angelayi
72699950b0 Copy model before benchmark warmup runs (#145858)
Fixes https://github.com/pytorch/pytorch/issues/144772

The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858
Approved by: https://github.com/desertfire
2025-01-30 00:36:33 +00:00
Simon Fan
e02c038a23 [dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590)
FIXES https://github.com/pytorch/pytorch/issues/144775 frfr

See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385
We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected.

This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests

New benchmark results:
```python
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364  # after https://github.com/pytorch/pytorch/pull/144319
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257  # before https://github.com/pytorch/pytorch/pull/144319
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590
Approved by: https://github.com/jansel
ghstack dependencies: #145447
2025-01-29 22:14:47 +00:00
Xu Zhao
991a4b5925 [dynamo] Add --profile-details and --export-perfdoctor option (#144751)
Summary:
Add `--profile-details` option to add shapes and other details to the Kineto profile.

Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor
```

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz

Differential Revision: D68134547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751
Approved by: https://github.com/drisspg
2025-01-23 19:09:40 +00:00
Simon Fan
34b8d8b0c0 update compile time benchmarks to dump compile times to stdout and csv (#145447)
```python
# inductor.csv
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186
```

```python
loading model: 0it [01:27, ?it/s]
cuda eval  cait_m36_384
Compilation time (from dynamo_timed): 87.705186276  # <----------------
pass
TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519
STATS: call_* op count: 2510 | FakeTensorMode.__torch_dispatch__:101743 | FakeTensor.__torch_dispatch__:12959 | ProxyTorchDispatchMode.__torch_dispatch__:41079
Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447
Approved by: https://github.com/ezyang
2025-01-23 18:49:19 +00:00
Aaron Orenstein
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
Xuehai Pan
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
bobrenjc93
fcf9dc3b11 Migrate from Tuple -> tuple in benchmarks (#144259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259
Approved by: https://github.com/yanboliang
2025-01-07 04:09:52 +00:00
Yanan Cao (PyTorch)
0666347fc4 [Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686)
Reviewed By: avikchaudhuri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686
Approved by: https://github.com/tugsbayasgalan
2024-12-21 19:56:56 +00:00
Huy Do
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
Tom Ritchford
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
Edward Z. Yang
c29b4edbb9 Remove no-op aot_compilation_time (#142490)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490
Approved by: https://github.com/xuzhao9
2024-12-11 10:37:25 +00:00
Huy Do
b5db3cb61c Skip uploading benchmark records when there is no model name (#141145)
A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145
Approved by: https://github.com/malfet
2024-11-20 19:05:47 +00:00
Huy Do
4acd56eb53 Upload MPS benchmark results (#141087)
This uploads the MPS benchmark results to benchmark database.  The data can then be queried, for example:

```
select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch'
```

I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database

### Testing

Locally,

```
PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv
```

Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087
Approved by: https://github.com/malfet
2024-11-20 18:18:21 +00:00
angelayi
878a849c92 [aoti] Remove example inputs from aoti_compile_and_package (#140991)
Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #140990
2024-11-20 02:49:47 +00:00
Bin Bao
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
PyTorch MergeBot
709752e0bb Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)"
This reverts commit 293fbb42d2.

Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))
2024-11-02 13:04:00 +00:00
Bin Bao
293fbb42d2 [AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154
Approved by: https://github.com/angelayi
ghstack dependencies: #139153
2024-11-02 03:10:05 +00:00
Edward Z. Yang
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
leslie-fang-intel
8072ebc36c SKIP llama for dynamic size testing (#135960)
Running Torchbench llama with dynamic size failed with
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
2024-09-15 00:06:49 +00:00
Pian Pawakapan
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
zengxian
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
Bin Bao
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00
Nikita Shulga
5f0bd98767 Increase max total number of dynamo partitions to 15 (#134153)
Needed to be able to split some of the aarch64 workflows to 15 shards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-21 23:10:12 +00:00
Bin Bao
5d5a45dc85 [CI][dashboard] Collect Export pass rate separately (#134076)
Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076
Approved by: https://github.com/angelayi
2024-08-21 21:18:55 +00:00
leslie-fang-intel
ac960dced1 Skip Reformer for Dynamic size testing (#132468)
**Summary**

As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API  `mark_dynamic` 3a355c1891/torch/_dynamo/decorators.py (L228-L230)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468
Approved by: https://github.com/ezyang
2024-08-08 08:25:53 +00:00
HDCharles
374747818d Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

other changes:

need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from             # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards

also updated the torchao APIs to the current versions

X-link: https://github.com/pytorch/benchmark/pull/2394

Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune

(should all be ~1.0
0.997x
1.006x
0.994x

Reviewed By: xuzhao9

Differential Revision: D60252821

Pulled By: HDCharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
2024-08-08 00:23:20 +00:00
Justin Chu
6966d44eda [ONNX] Rename _internal/exporter to _exporter_legacy (#132429)
The next PR will be creating an `exporter` directory to house logic from `torch-onnx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429
Approved by: https://github.com/titaiwangms
2024-08-03 04:23:05 +00:00
Sergii Dymchenko
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
Justin Chu
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
Xuehai Pan
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
Xu Zhao
1d8baa4df2 [torchbench][servicelab] Fix servicelab test failures (#130781)
Fix servicelab test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781
Approved by: https://github.com/desertfire
2024-07-16 17:35:13 +00:00
Xu Zhao
213685ba97 [torchao][pt2 benchmark runner] Run performance test non-alternately (#130136)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```

Differential Revision: D59332736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
2024-07-16 13:38:17 +00:00
titaiwangms
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Shunting Zhang
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
PyTorch MergeBot
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
titaiwangms
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
Shunting Zhang
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00