Commit Graph

121 Commits

Author SHA1 Message Date
James Wu
a7ca6a9113 Enable autograd cache on inductor tests (#140890)
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.

I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.

I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
zengxian
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
Yiming Zhou
050ad925f3 [benchmark] Add to torchbench relative path search (#134871)
Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871
Approved by: https://github.com/FindHao
2024-08-31 00:28:22 +00:00
Sergii Dymchenko
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
Animesh Jain
246e32055a [benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804)
Fixes https://github.com/pytorch/pytorch/issues/121989

We are turning on the flag by default in another PR. But that PR can go
through reverts. So, forcibly adding the benchmark to prevent dashboard
fluctuation in case of reverts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #131795, #131801
2024-07-26 00:20:42 +00:00
Xu Zhao
e3eaa22126 [torchbench][multisect] Run accuracy check at Diff time (#131266)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388

We can enable accuracy checks at Diff time since it is not a performance metric.

* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.

Test Plan:
Sandcastle CI

Or buck test command:

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429

Reviewed By: oulgen

Differential Revision: D59825601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
2024-07-22 20:14:28 +00:00
Xuehai Pan
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
Shunting Zhang
0fcbca9adb [pt2-bench] use eval mode for vision_maskrcnn (#130163)
Try to fix https://github.com/pytorch/pytorch/issues/130161

The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors

I fix that to always use eval mode for vision_maskrcnn training.

With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f

I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005
2024-07-06 00:49:15 +00:00
Shunting Zhang
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
PyTorch MergeBot
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
Shunting Zhang
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
Aaron Gokaslan
6c2a8b6b38 [Ez][BE]: Enable new stable ruff rules (#129825)
Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825
Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet
2024-07-02 14:47:10 +00:00
Animesh Jain
5d1763d159 Add lcnet to the inline_inbuilt_nn_module list (#129775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610
2024-06-29 05:47:28 +00:00
Animesh Jain
e4d8aa4d24 [torchbench] Enable some models with inline_inbuilt_nn_modules (#128315)
For all models, graph breaks/recompiles reduce.
For drq, it increases and this is a legit one.

Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315
Approved by: https://github.com/jansel
2024-06-16 08:37:23 +00:00
Sam Larsen
55a6b38f52 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-12 22:15:02 +00:00
PyTorch MergeBot
fa88f390a0 Revert "[inductor] enable fx graph cache on torchbench (#128239)"
This reverts commit 734e8f6ad7.

Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk 734e8f6ad7 ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))
2024-06-11 04:53:38 +00:00
Sam Larsen
734e8f6ad7 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-11 00:40:31 +00:00
Xuehai Pan
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
PyTorch MergeBot
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af6.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
Xuehai Pan
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
Yueming Hao
93ba5e7291 Fix typo for input (#126981)
The variable name should be `cloned_inputs` rather than `clone_inputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126981
Approved by: https://github.com/xuzhao9
2024-05-23 22:08:14 +00:00
Yueming Hao
2813f0672a fix huggingface models input issue in torchbench (#126579)
Fixes https://github.com/pytorch/benchmark/issues/2263.

According to https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L509, example_inputs are formatted as dictionaries for HuggingFace models. However, this forward_pass function passes all inputs to mod with *, which may only pass the input_ids key in HuggingFace model's example inputs.

To reproduce, run the following command.
```bash
python pytorch/benchmarks/dynamo/torchbench.py --performance --inference -dcuda --only=hf_Bert --output=torchbench_inference.csv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126579
Approved by: https://github.com/xuzhao9
2024-05-20 19:10:46 +00:00
Animesh Jain
f04c8471a4 [dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421, #124522
2024-05-04 22:08:56 +00:00
Deng Weishi
c8d2a55273 Intel GPU: specify the tolerance for torchbench models (#125213)
We encountered some model accuracy failures as the tolerance is critical. In general, we align with CUDA practice. This PR intends to adjust the tolerance for Torchbench models for training mode on Intel GPU devices and aligns with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125213
Approved by: https://github.com/desertfire
2024-05-01 17:45:15 +00:00
eellison
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
Tugsbayasgalan Manlaibaatar
d78991a738 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 20:58:16 +00:00
PyTorch MergeBot
8c7d8f0ff2 Revert "Make torch_geometric models compatible with export (#123403)"
This reverts commit 2ffab6e663.

Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))
2024-04-05 13:34:41 +00:00
Tugsbayasgalan Manlaibaatar
2ffab6e663 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 05:26:01 +00:00
Shunting Zhang
1c4887d52b fix dlrm accuracy test in max-autotune (#122012)
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.

I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.

The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-19 22:23:42 +00:00
James Wu
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00
Shunting Zhang
c4ed456fc3 [inductor] fix accuracy failure for a few models under freezing (#121054)
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.

For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.

For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.

One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
2024-03-02 04:53:59 +00:00
Yukio Siraichi
cef9f70f4b Move torchbench model configuration into a YAML file. (#120299)
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.

This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
2024-02-23 14:00:14 +00:00
Yukio Siraichi
2f6fc33c20 Move skip sets into a new file. (#118032)
This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more
readable YAML file, so that it is consumable from other projects (e.g. XLA).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-01-24 19:22:01 +00:00
haozhe.zhu
6500ccebd7 enable fp16 autocast for dynamo benchmark (#114088)
`--amp` to enable amp path for` CUDA` (default amp_dtype will be float16) and `CPU` (default amp_dtype will be bfloat16).

If users set `--amp_dtype`, the amp_dtype from users will have the highest priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-12-14 12:38:44 +00:00
Jason Ansel
de89a53df8 [benchmarking] Reduce box_detections_per_img for vision_maskrcnn (#115487)
This fixes a failure on the [perf dashboard](https://hud.pytorch.org/benchmark/compilers) with `--amp` mode.  I believe boxes 5 and 6 were getting swapped.  The existing comment explains the issue.

Before
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00171, (ref-fp64): 0.00054 and shape=torch.Size([256, 256, 3, 3])
[2023-12-09 13:21:27,292] torch._dynamo.utils: [ERROR] Accuracy failed for key name backbone.fpn.layer_blocks.2.0.weight.grad
fail_accuracy
```

After
```
$ ./benchmarks/dynamo/torchbench.py --training  --accuracy --no-translation-validatio --amp --backend=inductor --disable-cudagraphs --only vision_maskrcnn
...
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115487
Approved by: https://github.com/yanboliang
2023-12-11 08:42:25 +00:00
Jason Ansel
7bbc19adc4 [dynamo] Unskip DALLE2_pytorch (#114960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114960
Approved by: https://github.com/eellison
ghstack dependencies: #114959
2023-12-02 00:40:25 +00:00
Jason Ansel
67562c8cf8 Add DALLE2_pytorch to skips (#114924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114924
Approved by: https://github.com/huydhn
2023-12-01 07:15:59 +00:00
Jason Ansel
b35ca2cb94 Better error message for misconfigured torchbench model (#114827)
```
  File "/home/jansel/pytorch/./benchmarks/dynamo/torchbench.py", line 381, in load_model
    benchmark_cls.name = model_name
AttributeError: 'NoneType' object has no attribute 'name
```
becomes
```
  File "/home/jansel/pytorch/./benchmarks/dynamo/torchbench.py", line 381, in load_model
    raise NotImplementedError(f"{model_name}.Model is None")
NotImplementedError: torchrec_dlrm.Model is None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114827
Approved by: https://github.com/xuzhao9, https://github.com/yanboliang
2023-11-30 19:11:01 +00:00
eellison
605236af06 Force fp16 for vision_maskrcnn inference (#113110)
For fp16 for maskrcnn inference (doesnt support bf16). Also skip phi_1_5 in training - it OOMs even with batch size 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113110
Approved by: https://github.com/xmfan
2023-11-10 02:25:11 +00:00
Elias Ellison
f6fb9fd681 use smaller batch size for timm_efficientdet in inference (#113095)
Previously had OOMs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113095
Approved by: https://github.com/xmfan
ghstack dependencies: #112650
2023-11-07 07:08:16 +00:00
Elias Ellison
5c1ea30ca3 bump torchbench commit (#112650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112650
Approved by: https://github.com/msaroufim, https://github.com/xuzhao9
2023-11-07 03:56:16 +00:00
Simon Fan
28ebe5df7a yolov3: reduce batch size due to OOM (#111959)
yolov3 w/ cudagraphs (known to use more memory) is failing perf test due to OOM (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Mon,%2016%20Oct%202023%2020:19:47%20GMT&stopTime=Mon,%2023%20Oct%202023%2020:19:47%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=0b424ee0b7bfe09e0a438a63e8336e95eea85901&rBranch=main&rCommit=29048be41ca3aa8974795d93b9ea9fd6dee415fc)

I'm reducing the batch size from 16 to 8 to keep the same batch size for all yolov3 HUD benchmarks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111959
Approved by: https://github.com/xuzhao9
2023-10-25 06:18:53 +00:00
Simon Fan
88ef126a93 rename nanogpt_generate to nanogpt to also support train (#109746)
Differential Revision: [D49522940](https://our.internmc.facebook.com/intern/diff/D49522940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109746
Approved by: https://github.com/msaroufim, https://github.com/malfet, https://github.com/xuzhao9
2023-09-29 17:36:48 +00:00
angelayi
a565f1bee6 [aotinductor] Skip benchmarks with control flow (#109661)
Since AOTInductor doesn't support control flow yet, we will skip over tests that are currently failing due to containing control flow in the code. Logs taken from https://hud.pytorch.org/benchmark/compilers?startTime=Tue%2C%2012%20Sep%202023%2022%3A56%3A40%20GMT&stopTime=Tue%2C%2019%20Sep%202023%2022%3A56%3A40%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=main&lCommit=2c1554a0323107d821be3ff13df7833b9f0b960d&rBranch=main&rCommit=47be61e12bd51df27182343d312dc3df485d5559

Errors documented in https://github.com/pytorch/pytorch/issues/105217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109661
Approved by: https://github.com/desertfire
2023-09-25 18:49:06 +00:00
Mark Saroufim
e2cfbca5ab Add clip to dynamo runners (#109840)
CLIP was moved to canary models because we use the multimodal version which depends on torchtext which torchbench deprecated https://github.com/pytorch/benchmark/pull/1837

This issue didn't show up before because we hadn't updated the torchbench pin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109840
Approved by: https://github.com/cpuhrsch
2023-09-22 20:50:57 +00:00
eellison
d24ba7a634 Add 3d Attn Pattern to match HF Whisper (#109156)
Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr.

Thanks to @drisspg for helping me write the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142
2023-09-20 16:39:31 +00:00
Simon Fan
54c5f474a7 Forward rank and world size info to Torchbench models when using dynamo runner (#108438)
Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90

This is used for models which distribute over multiple GPUs e.g. simple_gpt https://github.com/pytorch/benchmark/pull/1867

Also add an option to skip multiprocess only gpu models

Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108438
Approved by: https://github.com/Chillee
2023-09-14 21:01:20 +00:00
drisspg
ad90ab31f2 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-13 13:59:05 +00:00
Huy Do
a9c663c269 Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1c.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 07:43:04 +00:00
PyTorch MergeBot
e45b290127 Revert "Revert "Flash Attention v2 (#105602)" (#108827)"
This reverts commit 24e9bbe22a.

Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))
2023-09-08 03:25:45 +00:00