Commit Graph

432 Commits

Author SHA1 Message Date
titaiwangms
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
Shunting Zhang
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
PyTorch MergeBot
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
titaiwangms
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
Shunting Zhang
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
Xuehai Pan
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
PyTorch MergeBot
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef1.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
Xuehai Pan
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
Xu Zhao
474d743dba [torchao][benchmark] Skip all accuracy tests by returning pass_due_to_skip (#129545)
Summary: As the title says.

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy
```

Differential Revision: D59040593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545
Approved by: https://github.com/HDCharles
2024-06-26 14:21:53 +00:00
Weizhuo Zhang
53f462c506 Write dynamo benchmarks performance result to csv when throw exceptions (#126764)
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.
![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c)

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:
![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117)

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
  File "benchmarks/dynamo/torchbench.py", line 449, in <module>
    torchbench_main()
  File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
    process_entry(0, runner, original_dir, args)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
    return run(runner, args, original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a)

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.

Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite      | before     | After fix  |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
2024-06-25 17:49:04 +00:00
titaiwangms
0e1e289033 [ONNX] Benchmark refactored ONNX export (#129427)
Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427
Approved by: https://github.com/justinchuby
2024-06-25 04:47:53 +00:00
Jason Ansel
bdc39eef3b [inductor] Add --inductor-config benchmark flag (#129034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024, #129033
2024-06-21 16:53:42 +00:00
Simon Fan
123812790b [compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960
Approved by: https://github.com/jansel
2024-06-21 08:16:33 +00:00
Deng Weishi
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
Animesh Jain
e4d8aa4d24 [torchbench] Enable some models with inline_inbuilt_nn_modules (#128315)
For all models, graph breaks/recompiles reduce.
For drq, it increases and this is a legit one.

Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315
Approved by: https://github.com/jansel
2024-06-16 08:37:23 +00:00
Wu, Chunyuan
5ef70faaa7 Revert "Make torch_geometric models compatible with export (#123403)" (#128377)
This reverts commit d78991a738.

This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377
Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire
2024-06-12 14:53:01 +00:00
Xu Zhao
82d7a36a27 Added torchao nightly workflow (#128152)
Summary:
Add torchao benchmark workflow, upload the artifacts to GHA.

X-link: https://github.com/pytorch/benchmark/pull/2273

Test Plan:
```
python run_benchmark.py torchao --ci
```

Differential Revision: D58140479

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152
Approved by: https://github.com/jerryzh168
2024-06-07 17:52:15 +00:00
Sun, Jiayi
2ff312359c skip hf_T5_generate in dynamic shape test (#121129)
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```

* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2024-06-07 06:28:29 +00:00
eellison
93bfe57144 cudagraphs: fix backward hooks & fsdp interaction (#126914)
Fixes

> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE

Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.

from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-05-28 22:07:41 +00:00
Xuehai Pan
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
PyTorch MergeBot
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af6.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
Xuehai Pan
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Yu, Guangye
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
Xu Zhao
1e818db547 [torchbench] Fix torchao benchmarking script (#126736)
As the title says.

Test Plan:

```
python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

cuda eval  BERT_pytorch
[XZ Debug] Torch grad status: False
memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89
running benchmark: 100%
1.001x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-21 23:15:12 +00:00
Xu Zhao
2068dadbe8 [torchbench] Add torchao to PT2 Benchmark Runner (#126469)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2268

Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

loading model: 0it [00:50, ?it/s]
cuda eval  BERT_pytorch
memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00
running benchmark: 100%
1.003x
```

Reviewed By: jerryzh168

Differential Revision: D57463273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469
Approved by: https://github.com/huydhn
2024-05-20 17:53:44 +00:00
Matthew Hoffman
81277baa0c Remove removed ruff rule TRY200 (#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.

From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/

> This rule has been removed and its documentation is only available for historical reasons.
>
> This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.

and we are currently explicitly ignoring B904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256
Approved by: https://github.com/Skylion007
2024-05-17 16:31:05 +00:00
Stonepia
5756b53dd8 [XPU] call empty_cache for dynamo tests (#126377)
When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models.

This PR unifies the `empty_cache` call for both CUDA and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-05-17 06:05:51 +00:00
Sam Larsen
c87c39d935 [benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953)
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
2024-05-15 05:32:06 +00:00
Sam Larsen
9f0d3f71c9 Adjust number of repeats when using --warm-start-latency benchmark flag (#125917)
Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917
Approved by: https://github.com/desertfire
2024-05-15 05:32:06 +00:00
Sam Larsen
966ebd2e24 Add --warm-start-latency to benchmark harness (#125353)
Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-05-09 21:12:15 +00:00
Yu, Guangye
d17be10df1 make torch.amp.autocast more generic (#125103)
# Motivation
As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend.

# Solution
When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC.

# Additional Context
With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`.
Add two new UTs to cover this change in eager and jit path respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui
2024-05-08 12:13:26 +00:00
BowenBao
a3d97f6ce4 [ONNX] Benchmark onnx export w/ ort fusions (#125700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700
Approved by: https://github.com/thiagocrepaldi
2024-05-08 01:10:05 +00:00
Animesh Jain
f04c8471a4 [dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324
Approved by: https://github.com/jansel
ghstack dependencies: #125439, #125421, #124522
2024-05-04 22:08:56 +00:00
Edward Z. Yang
e93b57a570 Add propagate_real_tensors mode for unbacked (#125115)
A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one.

This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are.

I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this:

```
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False
WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False
```

Potential later follow ups:

* Improve the warning messages (in particular, should provide user frames)
* GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115
Approved by: https://github.com/IvanKobzarev
2024-05-02 15:28:26 +00:00
Aaron Gokaslan
e3b9b71684 [BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126)
Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126
Approved by: https://github.com/ezyang
2024-04-28 21:44:33 +00:00
Stonepia
3d8585e501 [XPU] Add manual_seed and synchronize method (#124709)
This PR set the following device-specific settings for xpu(Intel GPU) specific:
1. Set the manual seed for xpu
2. Set the synchronization method for xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-04-26 12:32:12 +00:00
Simon Fan
14430564ce [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-26 03:22:29 +00:00
PyTorch MergeBot
154157416c Revert "[cudagraphs] add cudagraph_skips counter (#124804)"
This reverts commit fdad16b851.

Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))
2024-04-25 09:26:25 +00:00
Simon Fan
fdad16b851 [cudagraphs] add cudagraph_skips counter (#124804)
used in tests and benchmark csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804
Approved by: https://github.com/eellison
ghstack dependencies: #119729, #124700
2024-04-25 03:38:09 +00:00
eellison
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
Sam Larsen
290e3e7abb Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts.

Test Plan:
See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234
Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408
Approved by: https://github.com/desertfire
2024-04-19 02:46:00 +00:00
Simon Fan
7c94652d7d [benchmarks] Add --use-warm-peak-memory (#124326)
Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed)

```
e.g. resnet50
# without --warm-peak-memory
memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29

# with --warm-peak-memory
memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95
```

![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37)

This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory:
![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326
Approved by: https://github.com/jansel
ghstack dependencies: #119411
2024-04-18 02:57:01 +00:00
Simon Fan
0ddd17bdc6 [benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411)
creates memory snapshot pickles e.g.
```
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle
inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411
Approved by: https://github.com/jansel
2024-04-18 02:57:01 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
chunyuan
ec00daf4f1 [aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699)
## Pitch
Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function.

## Description
The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490:
a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)

The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context:
a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)

For eager mode, the `model_iter_fn`  (which is actually [forward_pass](e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558))) is used in [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2690))  and    [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L648)). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](e8ad5460c0/benchmarks/dynamo/common.py (L2695)) and [speedup_experiment](e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)), which doesn't have the `autocast` context thus will always run into FP32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-11 01:40:44 +00:00
angelayi
298171df5c [benchmark] Add namedtuple pytree serialization (#123648)
Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648
Approved by: https://github.com/desertfire
2024-04-09 22:25:36 +00:00
Tugsbayasgalan Manlaibaatar
d78991a738 Make torch_geometric models compatible with export (#123403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403
Approved by: https://github.com/angelayi
2024-04-05 20:58:16 +00:00
PyTorch MergeBot
8c7d8f0ff2 Revert "Make torch_geometric models compatible with export (#123403)"
This reverts commit 2ffab6e663.

Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))
2024-04-05 13:34:41 +00:00