Commit Graph

527 Commits

Author SHA1 Message Date
Xuehai Pan
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
Katarzyna Fojcik
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
Aaron Orenstein
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
Animesh Jain
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
PyTorch MergeBot
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e573.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
Katarzyna Fojcik
9da376daa6 Add retain-output argument (#145921)
This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921
Approved by: https://github.com/jansel
2025-02-05 19:45:09 +00:00
Justin Chu
9756c7d788 [benchmark] Remove ONNX (#146325)
ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325
Approved by: https://github.com/titaiwangms
2025-02-04 04:02:47 +00:00
Huy Do
f38d5b4a74 Update TorchBench commit to main (#145455)
I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit.

The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584

The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change.

* `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100
* `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455
Approved by: https://github.com/kit1980
2025-02-01 06:44:26 +00:00
IvanKobzarev
894ef8c1e3 [torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623)
Issue:
https://github.com/pytorch/pytorch/issues/144888

Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`

`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`

convolution folding results in increase of error almost at one order of magnitude.

I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?

At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
2025-01-30 12:46:35 +00:00
angelayi
72699950b0 Copy model before benchmark warmup runs (#145858)
Fixes https://github.com/pytorch/pytorch/issues/144772

The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858
Approved by: https://github.com/desertfire
2025-01-30 00:36:33 +00:00
Simon Fan
e02c038a23 [dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590)
FIXES https://github.com/pytorch/pytorch/issues/144775 frfr

See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385
We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected.

This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests

New benchmark results:
```python
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364  # after https://github.com/pytorch/pytorch/pull/144319
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257  # before https://github.com/pytorch/pytorch/pull/144319
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590
Approved by: https://github.com/jansel
ghstack dependencies: #145447
2025-01-29 22:14:47 +00:00
Xu Zhao
991a4b5925 [dynamo] Add --profile-details and --export-perfdoctor option (#144751)
Summary:
Add `--profile-details` option to add shapes and other details to the Kineto profile.

Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor
```

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz

Differential Revision: D68134547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751
Approved by: https://github.com/drisspg
2025-01-23 19:09:40 +00:00
Simon Fan
34b8d8b0c0 update compile time benchmarks to dump compile times to stdout and csv (#145447)
```python
# inductor.csv
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186
```

```python
loading model: 0it [01:27, ?it/s]
cuda eval  cait_m36_384
Compilation time (from dynamo_timed): 87.705186276  # <----------------
pass
TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519
STATS: call_* op count: 2510 | FakeTensorMode.__torch_dispatch__:101743 | FakeTensor.__torch_dispatch__:12959 | ProxyTorchDispatchMode.__torch_dispatch__:41079
Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447
Approved by: https://github.com/ezyang
2025-01-23 18:49:19 +00:00
Aaron Orenstein
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
Xuehai Pan
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
bobrenjc93
fcf9dc3b11 Migrate from Tuple -> tuple in benchmarks (#144259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259
Approved by: https://github.com/yanboliang
2025-01-07 04:09:52 +00:00
Yanan Cao (PyTorch)
0666347fc4 [Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686)
Reviewed By: avikchaudhuri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686
Approved by: https://github.com/tugsbayasgalan
2024-12-21 19:56:56 +00:00
Huy Do
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
Tom Ritchford
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
Edward Z. Yang
c29b4edbb9 Remove no-op aot_compilation_time (#142490)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490
Approved by: https://github.com/xuzhao9
2024-12-11 10:37:25 +00:00
Huy Do
b5db3cb61c Skip uploading benchmark records when there is no model name (#141145)
A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145
Approved by: https://github.com/malfet
2024-11-20 19:05:47 +00:00
Huy Do
4acd56eb53 Upload MPS benchmark results (#141087)
This uploads the MPS benchmark results to benchmark database.  The data can then be queried, for example:

```
select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch'
```

I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database

### Testing

Locally,

```
PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv
```

Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087
Approved by: https://github.com/malfet
2024-11-20 18:18:21 +00:00
angelayi
878a849c92 [aoti] Remove example inputs from aoti_compile_and_package (#140991)
Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #140990
2024-11-20 02:49:47 +00:00
Bin Bao
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
PyTorch MergeBot
709752e0bb Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)"
This reverts commit 293fbb42d2.

Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))
2024-11-02 13:04:00 +00:00
Bin Bao
293fbb42d2 [AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154
Approved by: https://github.com/angelayi
ghstack dependencies: #139153
2024-11-02 03:10:05 +00:00
Edward Z. Yang
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
Igor Sugak
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
leslie-fang-intel
8072ebc36c SKIP llama for dynamic size testing (#135960)
Running Torchbench llama with dynamic size failed with
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
2024-09-15 00:06:49 +00:00
Pian Pawakapan
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
zengxian
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
Bin Bao
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00
Nikita Shulga
5f0bd98767 Increase max total number of dynamo partitions to 15 (#134153)
Needed to be able to split some of the aarch64 workflows to 15 shards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-21 23:10:12 +00:00
Bin Bao
5d5a45dc85 [CI][dashboard] Collect Export pass rate separately (#134076)
Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076
Approved by: https://github.com/angelayi
2024-08-21 21:18:55 +00:00
leslie-fang-intel
ac960dced1 Skip Reformer for Dynamic size testing (#132468)
**Summary**

As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API  `mark_dynamic` 3a355c1891/torch/_dynamo/decorators.py (L228-L230)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468
Approved by: https://github.com/ezyang
2024-08-08 08:25:53 +00:00
HDCharles
374747818d Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

other changes:

need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from             # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards

also updated the torchao APIs to the current versions

X-link: https://github.com/pytorch/benchmark/pull/2394

Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune

(should all be ~1.0
0.997x
1.006x
0.994x

Reviewed By: xuzhao9

Differential Revision: D60252821

Pulled By: HDCharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
2024-08-08 00:23:20 +00:00
Justin Chu
6966d44eda [ONNX] Rename _internal/exporter to _exporter_legacy (#132429)
The next PR will be creating an `exporter` directory to house logic from `torch-onnx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429
Approved by: https://github.com/titaiwangms
2024-08-03 04:23:05 +00:00
Sergii Dymchenko
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
Justin Chu
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
Xuehai Pan
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
Xu Zhao
1d8baa4df2 [torchbench][servicelab] Fix servicelab test failures (#130781)
Fix servicelab test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781
Approved by: https://github.com/desertfire
2024-07-16 17:35:13 +00:00
Xu Zhao
213685ba97 [torchao][pt2 benchmark runner] Run performance test non-alternately (#130136)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```

Differential Revision: D59332736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
2024-07-16 13:38:17 +00:00
titaiwangms
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Shunting Zhang
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
PyTorch MergeBot
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
titaiwangms
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
Shunting Zhang
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
Xuehai Pan
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
PyTorch MergeBot
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef1.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
Xuehai Pan
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
Xu Zhao
474d743dba [torchao][benchmark] Skip all accuracy tests by returning pass_due_to_skip (#129545)
Summary: As the title says.

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy
```

Differential Revision: D59040593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545
Approved by: https://github.com/HDCharles
2024-06-26 14:21:53 +00:00
Weizhuo Zhang
53f462c506 Write dynamo benchmarks performance result to csv when throw exceptions (#126764)
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.
![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c)

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:
![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117)

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
  File "benchmarks/dynamo/torchbench.py", line 449, in <module>
    torchbench_main()
  File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
    process_entry(0, runner, original_dir, args)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
    return run(runner, args, original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a)

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.

Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite      | before     | After fix  |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
2024-06-25 17:49:04 +00:00
titaiwangms
0e1e289033 [ONNX] Benchmark refactored ONNX export (#129427)
Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427
Approved by: https://github.com/justinchuby
2024-06-25 04:47:53 +00:00
Jason Ansel
bdc39eef3b [inductor] Add --inductor-config benchmark flag (#129034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024, #129033
2024-06-21 16:53:42 +00:00
Simon Fan
123812790b [compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960
Approved by: https://github.com/jansel
2024-06-21 08:16:33 +00:00
Deng Weishi
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
Animesh Jain
e4d8aa4d24 [torchbench] Enable some models with inline_inbuilt_nn_modules (#128315)
For all models, graph breaks/recompiles reduce.
For drq, it increases and this is a legit one.

Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315
Approved by: https://github.com/jansel
2024-06-16 08:37:23 +00:00
Wu, Chunyuan
5ef70faaa7 Revert "Make torch_geometric models compatible with export (#123403)" (#128377)
This reverts commit d78991a738.

This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377
Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire
2024-06-12 14:53:01 +00:00
Xu Zhao
82d7a36a27 Added torchao nightly workflow (#128152)
Summary:
Add torchao benchmark workflow, upload the artifacts to GHA.

X-link: https://github.com/pytorch/benchmark/pull/2273

Test Plan:
```
python run_benchmark.py torchao --ci
```

Differential Revision: D58140479

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152
Approved by: https://github.com/jerryzh168
2024-06-07 17:52:15 +00:00
Sun, Jiayi
2ff312359c skip hf_T5_generate in dynamic shape test (#121129)
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```

* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2024-06-07 06:28:29 +00:00
eellison
93bfe57144 cudagraphs: fix backward hooks & fsdp interaction (#126914)
Fixes

> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE

Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.

from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-05-28 22:07:41 +00:00
Xuehai Pan
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
PyTorch MergeBot
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af6.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
Xuehai Pan
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Yu, Guangye
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
Xu Zhao
1e818db547 [torchbench] Fix torchao benchmarking script (#126736)
As the title says.

Test Plan:

```
python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

cuda eval  BERT_pytorch
[XZ Debug] Torch grad status: False
memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89
running benchmark: 100%
1.001x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-21 23:15:12 +00:00
Xu Zhao
2068dadbe8 [torchbench] Add torchao to PT2 Benchmark Runner (#126469)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2268

Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

loading model: 0it [00:50, ?it/s]
cuda eval  BERT_pytorch
memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00
running benchmark: 100%
1.003x
```

Reviewed By: jerryzh168

Differential Revision: D57463273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469
Approved by: https://github.com/huydhn
2024-05-20 17:53:44 +00:00
Matthew Hoffman
81277baa0c Remove removed ruff rule TRY200 (#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.

From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/

> This rule has been removed and its documentation is only available for historical reasons.
>
> This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.

and we are currently explicitly ignoring B904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256
Approved by: https://github.com/Skylion007
2024-05-17 16:31:05 +00:00
Stonepia
5756b53dd8 [XPU] call empty_cache for dynamo tests (#126377)
When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models.

This PR unifies the `empty_cache` call for both CUDA and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-05-17 06:05:51 +00:00
Sam Larsen
c87c39d935 [benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953)
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
2024-05-15 05:32:06 +00:00
Sam Larsen
9f0d3f71c9 Adjust number of repeats when using --warm-start-latency benchmark flag (#125917)
Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917
Approved by: https://github.com/desertfire
2024-05-15 05:32:06 +00:00
Sam Larsen
966ebd2e24 Add --warm-start-latency to benchmark harness (#125353)
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-05-09 21:12:15 +00:00
Yu, Guangye
d17be10df1 make torch.amp.autocast more generic (#125103)
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.

# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.

# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
2024-05-08 12:13:26 +00:00
BowenBao
a3d97f6ce4 [ONNX] Benchmark onnx export w/ ort fusions (#125700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700
Approved by: https://github.com/thiagocrepaldi
2024-05-08 01:10:05 +00:00
Animesh Jain
f04c8471a4 [dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421, #124522
2024-05-04 22:08:56 +00:00
Edward Z. Yang
e93b57a570 Add propagate_real_tensors mode for unbacked (#125115)
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.

This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.

I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:

```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```

Potential later follow ups:

* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
2024-05-02 15:28:26 +00:00
Aaron Gokaslan
e3b9b71684 [BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126)
Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126
Approved by: https://github.com/ezyang
2024-04-28 21:44:33 +00:00
Stonepia
3d8585e501 [XPU] Add manual_seed and synchronize method (#124709)
This PR set the following device-specific settings for xpu(Intel GPU) specific:
1. Set the manual seed for xpu
2. Set the synchronization method for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-04-26 12:32:12 +00:00
Simon Fan
14430564ce [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-26 03:22:29 +00:00
PyTorch MergeBot
154157416c Revert "[cudagraphs] add cudagraph_skips counter (#124804)"
This reverts commit fdad16b851.

Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))
2024-04-25 09:26:25 +00:00
Simon Fan
fdad16b851 [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-25 03:38:09 +00:00
eellison
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
Sam Larsen
290e3e7abb Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts.

Test Plan:
See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234
Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408
Approved by: https://github.com/desertfire
2024-04-19 02:46:00 +00:00
Simon Fan
7c94652d7d [benchmarks] Add --use-warm-peak-memory (#124326)
Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed)

```
e.g. resnet50
# without --warm-peak-memory
memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29

# with --warm-peak-memory
memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95
```

![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37)

This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory:
![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326
Approved by: https://github.com/jansel
ghstack dependencies: #119411
2024-04-18 02:57:01 +00:00
Simon Fan
0ddd17bdc6 [benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411)
creates memory snapshot pickles e.g.
```
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411
Approved by: https://github.com/jansel
2024-04-18 02:57:01 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
chunyuan
ec00daf4f1 [aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699)
## Pitch
Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function.

## Description
The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490:
a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)

The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context:
a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)

For eager mode, the `model_iter_fn`  (which is actually [forward_pass](e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558))) is used in [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2690))  and    [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L648)). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2695)) and [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)), which doesn't have the `autocast` context thus will always run into FP32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-11 01:40:44 +00:00
angelayi
298171df5c [benchmark] Add namedtuple pytree serialization (#123648)
Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648
Approved by: https://github.com/desertfire
2024-04-09 22:25:36 +00:00
Tugsbayasgalan Manlaibaatar
d78991a738 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 20:58:16 +00:00
PyTorch MergeBot
8c7d8f0ff2 Revert "Make torch_geometric models compatible with export (#123403)"
This reverts commit 2ffab6e663.

Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))
2024-04-05 13:34:41 +00:00
Tugsbayasgalan Manlaibaatar
2ffab6e663 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 05:26:01 +00:00
Angela Yi
482d8bf1ea [aoti] Change aot_compile callsites (#122225)
Summary:
Replacing `torch._export.aot_compile` callsites with
```
ep = torch.export._trace._export(.., predispatch=True)   # Traces the given program into predispatch IR
so_path = torch._inductor.aot_compile_ep(ep, ...)  # Takes an exported program and compiles it into a .so
```

This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better.

Test Plan: CI

Differential Revision: D54808612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225
Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov
2024-03-29 21:34:20 +00:00
eellison
ba69dc6675 [Easy] add option to print compilation time (#121996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996
Approved by: https://github.com/davidberard98
2024-03-18 22:42:41 +00:00
Animesh Jain
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
James Wu
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00