pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
Justin Chu	9db567f17d	[ONNX] Set dump_exported_program to True in bench (#131670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670 Approved by: https://github.com/titaiwangms	2024-07-24 20:02:03 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit `dafbd603ee`. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
titaiwangms	bffb278700	[ONNX] Add `artifacts_dir` to torch-onnx-patch in benchmark (#130069 ) Add `artifacts_dir` to torch-onnx-patch to save error report for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069 Approved by: https://github.com/justinchuby	2024-07-04 07:11:02 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit `6d75604ef1`. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Xu Zhao	474d743dba	[torchao][benchmark] Skip all accuracy tests by returning `pass_due_to_skip` (#129545 ) Summary: As the title says. Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy ``` Differential Revision: D59040593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545 Approved by: https://github.com/HDCharles	2024-06-26 14:21:53 +00:00
Weizhuo Zhang	53f462c506	Write dynamo benchmarks performance result to csv when throw exceptions (#126764 ) Performance mode Issue: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files. ![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c) - Fix: The warm-up failed models will be recorded into csv file shown as following: ![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117) Accuracy mode issue: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR `ee557d8f61`. ``` Dynamic Shape: Traceback (most recent call last): File "benchmarks/dynamo/torchbench.py", line 449, in <module> torchbench_main() File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main process_entry(0, runner, original_dir, args) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry return run(runner, args, original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` ![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a) - Fix: same as PR `ee557d8f61`, the batch_size will be skipped to set as 4 when testing dynamic shapes. Dynamic shapes passrate improved from 89% -> 95% \| Comp Item \| Compiler \| suite \| before \| After fix \| \|-----------\|----------\|------------\|------------\|------------\| \| Pass Rate \| Inductor \| torchbench \| 89%, 73/82 \| 95%, 79/83 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764 Approved by: https://github.com/jansel	2024-06-25 17:49:04 +00:00
titaiwangms	0e1e289033	[ONNX] Benchmark refactored ONNX export (#129427 ) Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427 Approved by: https://github.com/justinchuby	2024-06-25 04:47:53 +00:00
Jason Ansel	bdc39eef3b	[inductor] Add --inductor-config benchmark flag (#129034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024, #129033	2024-06-21 16:53:42 +00:00
Simon Fan	123812790b	[compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960 Approved by: https://github.com/jansel	2024-06-21 08:16:33 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
Animesh Jain	e4d8aa4d24	[torchbench] Enable some models with inline_inbuilt_nn_modules (#128315 ) For all models, graph breaks/recompiles reduce. For drq, it increases and this is a legit one. Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315 Approved by: https://github.com/jansel	2024-06-16 08:37:23 +00:00
Wu, Chunyuan	5ef70faaa7	Revert "Make torch_geometric models compatible with export (#123403 )" (#128377 ) This reverts commit `d78991a738`. This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377 Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire	2024-06-12 14:53:01 +00:00
Xu Zhao	82d7a36a27	Added torchao nightly workflow (#128152 ) Summary: Add torchao benchmark workflow, upload the artifacts to GHA. X-link: https://github.com/pytorch/benchmark/pull/2273 Test Plan: ``` python run_benchmark.py torchao --ci ``` Differential Revision: D58140479 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152 Approved by: https://github.com/jerryzh168	2024-06-07 17:52:15 +00:00
Sun, Jiayi	2ff312359c	skip hf_T5_generate in dynamic shape test (#121129 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV guards = output_graph.shape_env.produce_guards( File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4). ``` * Root Cause is This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked` I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)` The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens. So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error: `torch._dynamo.exc.Unsupported: Tensor.item` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129 Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang	2024-06-07 06:28:29 +00:00
eellison	93bfe57144	cudagraphs: fix backward hooks & fsdp interaction (#126914 ) Fixes > ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations. from code comment: ``` # this output represents a fresh allocated tensor. # We return the same TensorImpl from run to run to avoid overhead. # autograd.Function will reset the Autograd meta of output tensors # as part of aot_autograd, but _backward_hooks are stored on tensors separately, # so we need to manually reset hooks. `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-05-28 22:07:41 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Xu Zhao	1e818db547	[torchbench] Fix torchao benchmarking script (#126736 ) As the title says. Test Plan: ``` python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory cuda eval BERT_pytorch [XZ Debug] Torch grad status: False memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89 running benchmark: 100% 1.001x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736 Approved by: https://github.com/jerryzh168, https://github.com/huydhn	2024-05-21 23:15:12 +00:00
Xu Zhao	2068dadbe8	[torchbench] Add torchao to PT2 Benchmark Runner (#126469 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2268 Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory loading model: 0it [00:50, ?it/s] cuda eval BERT_pytorch memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00 running benchmark: 100% 1.003x ``` Reviewed By: jerryzh168 Differential Revision: D57463273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469 Approved by: https://github.com/huydhn	2024-05-20 17:53:44 +00:00
Matthew Hoffman	81277baa0c	Remove removed ruff rule TRY200 (#126256 ) My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema. From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/ > This rule has been removed and its documentation is only available for historical reasons. > > This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead. and we are currently explicitly ignoring B904. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256 Approved by: https://github.com/Skylion007	2024-05-17 16:31:05 +00:00
Stonepia	5756b53dd8	[XPU] call empty_cache for dynamo tests (#126377 ) When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models. This PR unifies the `empty_cache` call for both CUDA and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire	2024-05-17 06:05:51 +00:00
Sam Larsen	c87c39d935	[benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953 ) Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953 Approved by: https://github.com/desertfire ghstack dependencies: #125917	2024-05-15 05:32:06 +00:00
Sam Larsen	9f0d3f71c9	Adjust number of repeats when using --warm-start-latency benchmark flag (#125917 ) Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917 Approved by: https://github.com/desertfire	2024-05-15 05:32:06 +00:00
Sam Larsen	966ebd2e24	Add --warm-start-latency to benchmark harness (#125353 ) Summary: This change introduces a new flagg to perform a "warm start" test from the benchmark harness. The idea is to test a model twice: first with a fresh inductor cache (i.e., a "cold start"), and then a second run in a fresh process with the cache available (i.e. a "warm start"). We can later add this mode to CI runs to collect compile times for warm start. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125353 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-05-09 21:12:15 +00:00
Yu, Guangye	d17be10df1	make torch.amp.autocast more generic (#125103 ) # Motivation As discussed in [#124479](https://github.com/pytorch/pytorch/pull/124479), `torch.amp.autocast` can NOT be completely equivalent to `torch.cuda.amp.autocast` and `torch.cpu.amp.autocast` since `torch.amp.autocast` has NOT the default `dtype` for CPU (`torch.bfloat16` by default) and CUDA (`torch.float16` by default) respectively. We would like `torch.amp.autocast` to be more generic to help the developer/customer write the device-agnostic code. Because there are not enough reasons to add device-specific autocast `torch.xxx.amp.autocast` for each device backend. # Solution When `None` is passed to `dtype`, we should use `torch.get_autocast_dtype` to get the related dtype for each backend. Meanwhile, `torch.get_autocast_dtype` is necessary to be supported in JIT path for BC. # Additional Context With this PR, `torch.amp.autocast(device_type='cuda')` is equivalent to `torch.cuda.amp.autocast`. Add two new UTs to cover this change in eager and jit path respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125103 Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui	2024-05-08 12:13:26 +00:00
BowenBao	a3d97f6ce4	[ONNX] Benchmark onnx export w/ ort fusions (#125700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125700 Approved by: https://github.com/thiagocrepaldi	2024-05-08 01:10:05 +00:00
Animesh Jain	f04c8471a4	[dynamo][prepare for nn module guards] Guard nn modules for a few benchmarks (#125324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125324 Approved by: https://github.com/jansel ghstack dependencies: #125439, #125421, #124522	2024-05-04 22:08:56 +00:00
Edward Z. Yang	e93b57a570	Add propagate_real_tensors mode for unbacked (#125115 ) A common complaint when working with data-dependent code in PyTorch is that it's hard to tell how far you are from the finish line: every time a GuardOnDataDependentSymNode error is hit, you have to somehow fix or workaround it to see the next one. This PR adds a new mode `torch._functorch.config.fake_tensor_propagate_real_tensors` which modifies fake tensors to also propagate real tensors. This means that when we try to guard on a data-dependent SymNode, we can actually produce a real result. We also produce a warning which you should consult to figure out what the crux points are. I ran this on vision_maskrcnn. In the baseline (without this mode), the model has 27 graph breaks, resulting in 40 graphs. With this mode on, the model has only 11 graph breaks, resulting in 15 graphs (the remaining graph breaks are due to missing functionality for item() on float tensor and some other Dynamo missing features.) You get a list of things that would have errored like this: ``` WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u0), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u1) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u1), 1)) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Ne(Max(1, u1), 1)) -> True WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Max(1, u0) < 2) -> False WARNING:torch.fx.experimental.symbolic_shapes:propagate_real_tensors evaluate_expr(Eq(Max(1, u0), 1)) -> False ``` Potential later follow ups: * Improve the warning messages (in particular, should provide user frames) * GC real tensors when they are no longer needed by tracing. Right now, this will use A LOT of memory, equal to as if your GC was broken and every intermediate tensor was kept live Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125115 Approved by: https://github.com/IvanKobzarev	2024-05-02 15:28:26 +00:00
Aaron Gokaslan	e3b9b71684	[BE]: Ruff - TRY401 - Avoid verbose exception logging (#125126 ) Don't bother logging exception obj explicitly with logger, it's captured anyway and would generate verbose outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125126 Approved by: https://github.com/ezyang	2024-04-28 21:44:33 +00:00
Stonepia	3d8585e501	[XPU] Add manual_seed and synchronize method (#124709 ) This PR set the following device-specific settings for xpu(Intel GPU) specific: 1. Set the manual seed for xpu 2. Set the synchronization method for xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/124709 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-04-26 12:32:12 +00:00
Simon Fan	14430564ce	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-26 03:22:29 +00:00
PyTorch MergeBot	154157416c	Revert "[cudagraphs] add cudagraph_skips counter (#124804 )" This reverts commit `fdad16b851`. Reverted https://github.com/pytorch/pytorch/pull/124804 on behalf of https://github.com/jeanschmidt due to one PR in this stack seems to have broken linux pull cuda12 tests ([comment](https://github.com/pytorch/pytorch/pull/119729#issuecomment-2076750595))	2024-04-25 09:26:25 +00:00
Simon Fan	fdad16b851	[cudagraphs] add cudagraph_skips counter (#124804 ) used in tests and benchmark csv Pull Request resolved: https://github.com/pytorch/pytorch/pull/124804 Approved by: https://github.com/eellison ghstack dependencies: #119729, #124700	2024-04-25 03:38:09 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
Sam Larsen	290e3e7abb	Add ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408 ) Summary: The intent is that we can whitelist certain benchmarks to a) enable TORCH_COMPILE_DEBUG=1, and b) save the generated artifacts in test/debug in case of a failure. Via the rules in action.yml, we can then upload test/debug/ to S3 whenever it exists. I chose to introduce a new directory (test/debug/) rather than using an existing one (e.g., test/test-reports/), because these don't seem like test reports and we can later add other debug-related artifacts if we find it useful. For example, we might want to later explore including the inductor cache artifacts. Test Plan: See artifacts generated when I force a failure: https://hud.pytorch.org/pr/124234 Specifically: https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/8729891826/1/artifact/debug-test-inductor_torchbench-2-2-linux.g5.4xlarge.nvidia.gpu_23953679574.zip Pull Request resolved: https://github.com/pytorch/pytorch/pull/124408 Approved by: https://github.com/desertfire	2024-04-19 02:46:00 +00:00
Simon Fan	7c94652d7d	[benchmarks] Add --use-warm-peak-memory (#124326 ) Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed) ``` e.g. resnet50 # without --warm-peak-memory memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29 # with --warm-peak-memory memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95 ``` ![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37) This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory: ![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326 Approved by: https://github.com/jansel ghstack dependencies: #119411	2024-04-18 02:57:01 +00:00
Simon Fan	0ddd17bdc6	[benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411 ) creates memory snapshot pickles e.g. ``` inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411 Approved by: https://github.com/jansel	2024-04-18 02:57:01 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
chunyuan	ec00daf4f1	[aotinductor] Fix benchmarks with self.autocast for run_performance_test (#123699 ) ## Pitch Similar to https://github.com/pytorch/pytorch/pull/110490 which fixes the `self.autocast` in the `check_accuracy` function, this PR fixes the `self.autocast` context in the `run_performance_test` function. ## Description The code inside `check_accuracy` after the fix on https://github.com/pytorch/pytorch/pull/110490: `a4a49f77b8/benchmarks/dynamo/common.py (L2490-L2500)` The current code on main branch before this PR in `run_performance_test` does not have the `self.autocast` context: `a4a49f77b8/benchmarks/dynamo/common.py (L2685-L2692)` For eager mode, the `model_iter_fn` (which is actually [forward_pass](`e8ad5460c0/benchmarks/dynamo/huggingface.py (L556-L558)`)) is used in [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2690)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L648)`). The `forward_pass` has the `self.autocast` context thus it could run into BF16 when AMP is on. While for AOTInductor, we will call `export_aot_inductor` in both [warmup](`e8ad5460c0/benchmarks/dynamo/common.py (L2695)`) and [speedup_experiment](`e8ad5460c0/benchmarks/dynamo/common.py (L644-L646)`), which doesn't have the `autocast` context thus will always run into FP32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123699 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-11 01:40:44 +00:00
angelayi	298171df5c	[benchmark] Add namedtuple pytree serialization (#123648 ) Fixes https://github.com/pytorch/pytorch/pull/123388#issuecomment-2045289729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123648 Approved by: https://github.com/desertfire	2024-04-09 22:25:36 +00:00
Tugsbayasgalan Manlaibaatar	d78991a738	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 20:58:16 +00:00
PyTorch MergeBot	8c7d8f0ff2	Revert "Make torch_geometric models compatible with export (#123403 )" This reverts commit `2ffab6e663`. Reverted https://github.com/pytorch/pytorch/pull/123403 on behalf of https://github.com/atalman due to Related issue basic_gnn_gin ([comment](https://github.com/pytorch/pytorch/pull/123403#issuecomment-2039817292))	2024-04-05 13:34:41 +00:00
Tugsbayasgalan Manlaibaatar	2ffab6e663	Make torch_geometric models compatible with export (#123403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123403 Approved by: https://github.com/angelayi	2024-04-05 05:26:01 +00:00
Angela Yi	482d8bf1ea	[aoti] Change aot_compile callsites (#122225 ) Summary: Replacing `torch._export.aot_compile` callsites with ``` ep = torch.export._trace._export(.., predispatch=True) # Traces the given program into predispatch IR so_path = torch._inductor.aot_compile_ep(ep, ...) # Takes an exported program and compiles it into a .so ``` This allows us to explicitly split up the export step from AOTInductor. We can later modify tests to do `export + serialize + deserialize + inductor` to mimic internal production use cases better. Test Plan: CI Differential Revision: D54808612 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122225 Approved by: https://github.com/SherlockNoMad, https://github.com/khabinov	2024-03-29 21:34:20 +00:00
eellison	ba69dc6675	[Easy] add option to print compilation time (#121996 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996 Approved by: https://github.com/davidberard98	2024-03-18 22:42:41 +00:00
Animesh Jain	cd1751b14f	[dynamo] Measure Dynamo cache latency lookup (#121604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604 Approved by: https://github.com/jansel ghstack dependencies: #121614, #121622	2024-03-12 17:09:11 +00:00
James Wu	ae22bdaefe	Update torchbench commit pin, add sam_fast benchmark (#121420 ) After this, the sam_fast benchmark can now be run in the pytorch repo: ``` SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast ``` sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420 Approved by: https://github.com/oulgen, https://github.com/msaroufim	2024-03-11 19:48:53 +00:00
Sun, Jiayi	ee557d8f61	skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` * Root Cause is Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. However, the inputs of `detectron2_fcos_r_50_fpn` are as follows: ``` ([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124., 82., ..., 3., 4., 5.], [125., 104., 65., ..., 3., 3., 4.], [ 87., 68., 34., ..., 2., 2., 2.], ..., [ 47., 45., 41., ..., 45., 45., 45.], [ 46., 44., 40., ..., 44., 45., 46.], [ 46., 44., 40., ..., 43., 45., 46.]], [[154., 129., 84., ..., 3., 4., 5.], [133., 110., 69., ..., 3., 3., 4.], [ 95., 76., 43., ..., 2., 2., 2.], ..., [ 44., 42., 38., ..., 34., 37., 39.], [ 43., 41., 37., ..., 35., 39., 41.], [ 43., 41., 37., ..., 35., 40., 43.]], [[171., 140., 85., ..., 3., 4., 5.], [147., 120., 71., ..., 3., 3., 4.], [103., 83., 47., ..., 2., 2., 2.], ..., [ 46., 44., 40., ..., 16., 20., 22.], [ 45., 43., 39., ..., 17., 22., 26.], [ 45., 43., 39., ..., 18., 24., 28.]]])}, ... ],) ``` None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-03-05 12:12:18 +00:00
PyTorch MergeBot	368f242e37	Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 )" This reverts commit `8c2e569928`. Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))	2024-03-03 02:58:47 +00:00
Shunting Zhang	c4ed456fc3	[inductor] fix accuracy failure for a few models under freezing (#121054 ) Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn. For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass. For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now. One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054 Approved by: https://github.com/eellison	2024-03-02 04:53:59 +00:00
Chien-Chin Huang	8c2e569928	[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454 ) With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model. Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454 Approved by: https://github.com/yf225, https://github.com/xmfan	2024-03-01 08:35:22 +00:00
leslie-fang-intel	950b484356	skip three pyhpc models with dynamic shape test (#120599 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR. * Error msg is ``` File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 1048576 ``` * Root Cause is * Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size `c617e7b407/benchmarks/dynamo/common.py (L3867-L3871)`. If it fails to find any dim equals to batch size, above error throws. * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](`26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16)`) ``` shape = ( math.ceil(2 * size ** (1/3)), math.ceil(2 * size ** (1/3)), math.ceil(0.25 * size ** (1/3)), ) ``` * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (`c617e7b407/benchmarks/dynamo/common.py (L3456)`) and `math.ceil(2 * size ** (1/3))` happens equaling to 4. * Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-02-29 00:38:06 +00:00
leslie-fang-intel	c617e7b407	Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384 ) After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is: `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. ` - `torch.use_deterministic_algorithms(True)` only setting for accuracy test. `fff9d98e58/benchmarks/dynamo/common.py (L3480)` - However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. `fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)` Add these 2 models into the deterministic_algorithms exclusive model list in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384 Approved by: https://github.com/desertfire, https://github.com/jgong5	2024-02-26 05:05:43 +00:00
Chien-Chin Huang	c0e5cca4f8	[DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437 ) Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437 Approved by: https://github.com/wconstab, https://github.com/xmfan	2024-02-13 16:53:56 +00:00
BowenBao	30f43e3d89	[ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710 ) Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device. This PR modifies the script to deepcopy and export the model on another device when possible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710 Approved by: https://github.com/thiagocrepaldi	2024-01-31 23:03:39 +00:00
Simon Fan	ed0ec2e0be	Remove dynamo runner's dependency on distributed build (#117903 ) So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903 Approved by: https://github.com/xuzhao9	2024-01-24 06:51:14 +00:00
Bin Bao	4d625c1c92	[AOTI] Fix a bug in the torch._export.aot_load API (#118039 ) Summary: tree_flatten_spec should use args instead of *args clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes Test Plan: CI Differential Revision: D52982401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039 Approved by: https://github.com/angelayi	2024-01-23 14:54:02 +00:00
Michael Lazos	f302a0d380	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-19 04:28:50 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
PyTorch MergeBot	b0084be114	Revert "Re-enable SGD (#117434 )" This reverts commit `e7fac72be7`. Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))	2024-01-18 11:37:36 +00:00
Michael Lazos	e7fac72be7	Re-enable SGD (#117434 ) Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434 Approved by: https://github.com/anijain2305, https://github.com/janeyx99	2024-01-18 06:47:15 +00:00
Simon Fan	4b25948ee6	Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332 ) - Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp` - Append rank name to traces to avoid all ranks trying to create the same file - Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332 Approved by: https://github.com/H-Huang, https://github.com/wconstab	2024-01-12 22:41:09 +00:00
Simon Fan	88bf84f106	[benchmark] add --compile-autograd to dynamo benchmarks (#117196 ) Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats e.g. accuracy_inductor.csv ``` dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1 cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0 cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0 cuda,LearningToPaint,4,pass,639,2,8,7,1,1 ... ``` e.g. speedup_inductor.csv ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1 cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196 Approved by: https://github.com/jansel	2024-01-11 20:12:58 +00:00
Bin Bao	7e9cbc6834	[CI] Catch more exception types when running eager in PT2 tests (#117120 ) Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120 Approved by: https://github.com/huydhn	2024-01-11 17:46:11 +00:00
Bin Bao	b8374314cc	[AOTI] Update AOTI runner util (#116971 ) Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971 Approved by: https://github.com/chenyang78	2024-01-09 19:07:54 +00:00
Bin Bao	640d46f823	[inductor] Control the cpp_wrapper mode with an env variable (#116615 ) Summary: also add one model test for the cpp_wrapper mode on CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615 Approved by: https://github.com/angelayi	2024-01-02 21:50:25 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
Isuru Fernando	a254fbfd61	Initialize variable for all codepaths in dynamo benchmarks (#116260 ) Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260 Approved by: https://github.com/lezcano	2023-12-26 05:15:39 +00:00
BowenBao	259b0af367	[ONNX] Add copy before export for perf bench to avoid mutating base model (#115945 ) Otherwise base model might be mutated and affects the performance measured. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115945 Approved by: https://github.com/justinchuby, https://github.com/titaiwangms	2023-12-21 01:20:46 +00:00
Michael Lazos	be90b757d9	Enable compiled Adam in the benchmarks (#116093 ) Commit b697bcc583 of mlazos/compiled-adam2 at https://hud.pytorch.org/benchmark/compilers is an initial benchmark run Increases compile time by 20s for torchbench and HF, and 30s for TIMM I expect the compile time to come down significantly with fake tensor prop caching Pull Request resolved: https://github.com/pytorch/pytorch/pull/116093 Approved by: https://github.com/janeyx99	2023-12-21 00:17:36 +00:00
Michael Lazos	80b1ecc308	Run eager adam optimizer in benchmarks where possible (#115445 ) Runs eager Adam (instead of SGD) on all models that don't fail accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115445 Approved by: https://github.com/desertfire	2023-12-18 18:28:23 +00:00
BowenBao	7e6ec8d3db	[ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115773 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670, #115673	2023-12-15 00:37:32 +00:00
BowenBao	823523acc0	[ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115673 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #115670	2023-12-15 00:37:32 +00:00
BowenBao	0959e67de3	[ONNX] Set correct cuda.current_device for multi-device onnx performance bench (#115670 ) Otherwise `torch.cuda.synchronize()` works on a different device from the one that runs PyTorch model, which lead to incorrect performance number. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115670 Approved by: https://github.com/thiagocrepaldi	2023-12-15 00:37:32 +00:00
haozhe.zhu	6500ccebd7	enable fp16 autocast for dynamo benchmark (#114088 ) `--amp` to enable amp path for` CUDA` (default amp_dtype will be float16) and `CPU` (default amp_dtype will be bfloat16). If users set `--amp_dtype`, the amp_dtype from users will have the highest priority. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114088 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-12-14 12:38:44 +00:00
Bin Bao	26266c9718	[CI] Call torch.cuda.empty_cache to release device memory (#114663 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114663 Approved by: https://github.com/eellison	2023-12-10 21:27:42 +00:00
Jason Ansel	694cc6af56	[benchmarks] Fix NameError: name 'args' is not defined (#115494 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115494 Approved by: https://github.com/Skylion007, https://github.com/desertfire	2023-12-10 21:22:21 +00:00
Bin Bao	81b565b142	[CI] Fix a missing write_csv_when_exception problem (#115370 ) Summary: Fix a problem shown in https://github.com/pytorch/pytorch/actions/runs/7124839624/job/19400589129 when a model times out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115370 Approved by: https://github.com/eellison	2023-12-08 18:09:53 +00:00
Bin Bao	5f939e32e3	[CI] Log load_model failures in csv (#114784 ) Summary: Right now when load_model fails (either because of loading error or validation eager run failure), the result won't be logged in generated csv files. Let's log them in csv so that they are monitored by the expected results checking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114784 Approved by: https://github.com/malfet	2023-12-06 15:19:16 +00:00
BowenBao	b9c4fb68c5	[ONNX][Bench] Fix model name retrieval and remove unused argument (#115108 ) Might be some upstream updates, the previous hack starts to not pick up model names, updating to use the other more appropriate variable. Also fix a bug with an unused argument that was supposed to be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115108 Approved by: https://github.com/thiagocrepaldi	2023-12-05 23:55:12 +00:00
BowenBao	77c4565d58	[ONNX][Bench] Remove double export and session init in perf test (#114907 ) Previously both `optimize_ctx` call and `experiment` call will do export and session creation, ending up doubling the resource cost. This PR makes `experiment` call re-use the onnx model created by `optimize_ctx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114907 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #110178	2023-12-02 00:17:07 +00:00
BowenBao	baeb0705fe	[ONNX][Bench] Add warmup for onnx cuda runs (#114821 ) Increases perf accuracy especially for low iteration runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114821 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #112179, #114767	2023-11-30 20:41:44 +00:00
BowenBao	c1e51fcbfc	[ONNX][Bench] Relax tolerance for cuda accuracy check (#114767 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114767 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #112179	2023-11-30 04:43:46 +00:00
Bin Bao	ffa974b940	[CI] Dump more detailed error msg in PT2 integration tests (#114683 ) Summary: Sometimes a PT2 CI test shows as both pass and infra_error, e.g. https://github.com/pytorch/pytorch/actions/runs/7015184949/job/19086433407. Add more logging to investigate what has happened. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114683 Approved by: https://github.com/eellison	2023-11-29 18:44:23 +00:00
Bin Bao	11277cc510	[CI] Remove an exception catching for Triton compiler error (#113064 ) Summary: The workaround was there when Triton compiler was at its early stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113064 Approved by: https://github.com/eellison	2023-11-28 23:46:30 +00:00
Aaron Gokaslan	9f073ae304	[BE][Easy]: add some PLR pylint checks and exclusions to ruff (#114519 ) Add a couple of additional checks and exclusions Pull Request resolved: https://github.com/pytorch/pytorch/pull/114519 Approved by: https://github.com/jansel	2023-11-28 20:49:03 +00:00
BowenBao	bebe66e262	[ONNX] Benchmark to save sample inputs to disk before running (#114163 ) Such that even if failures occur during model run, the sample inputs are accessible for later investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114163 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #113780	2023-11-22 05:39:00 +00:00
Bin Bao	6ff7260700	[CI] Switch to check against expected result files for cpu inductor integration tests (#113668 ) Summary: With this, we can completely remove CI_SKIP from common.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113668 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #113574, #113575, #113446, #113559	2023-11-21 21:20:47 +00:00
Bin Bao	a9f9f98e2f	[CI] Switch to check against expected result files for dynamo_eager and aot_eager benchmark tests (#113559 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113559 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #113574, #113575, #113446	2023-11-21 21:20:47 +00:00
Bin Bao	212f668408	[CI] Remove CI skip list for inductor integration tests (#113446 ) Summary: Switch to completely rely on checking against expected result files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113446 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/jansel ghstack dependencies: #113574, #113575	2023-11-21 21:20:41 +00:00
leslie-fang-intel	fb3bc3949a	[Inductor] remove GPT2ForSequenceClassification from ci skip list (#112100 ) Summary As discussed in https://github.com/pytorch/pytorch/issues/109019, the accuracy issue of `GPT2ForSequenceClassification` has been fixed in https://github.com/pytorch/pytorch/pull/108690. Remove it from CI Skip list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112100 Approved by: https://github.com/lezcano	2023-11-19 05:12:18 +00:00
BowenBao	b169f04170	[ONNX] Fix bench w/ iobinding; Remove cpu fallback (#113703 ) Summary - `TORCH_TO_NUMPY_DTYPE` was misplaced previously hence subclasses cannot access it. - Remove cpu fallback when benching onnx with gpu, expose gpu run failures properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113703 Approved by: https://github.com/thiagocrepaldi ghstack dependencies: #113404, #113697	2023-11-18 01:33:06 +00:00
Jane Xu	ac08022137	[BE][benchmarks] Minor comment cleanup, typos (#113898 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113898 Approved by: https://github.com/desertfire	2023-11-17 19:03:41 +00:00
eellison	605236af06	Force fp16 for vision_maskrcnn inference (#113110 ) For fp16 for maskrcnn inference (doesnt support bf16). Also skip phi_1_5 in training - it OOMs even with batch size 1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113110 Approved by: https://github.com/xmfan	2023-11-10 02:25:11 +00:00
Bin Bao	f6c00b16c8	[aotinductor] Update the benchmarking script to clone an eager model (#113046 ) Summary: fix https://github.com/pytorch/pytorch/issues/113029 where running a model in eager somehow can change a weight stride Pull Request resolved: https://github.com/pytorch/pytorch/pull/113046 Approved by: https://github.com/angelayi	2023-11-08 22:05:03 +00:00
William Wen	ad1c3467e2	[dynamo] run guard fail hooks for each cache entry for which there is a cache miss (#110325 ) Attempt number 2 at https://github.com/pytorch/pytorch/issues/108950. Improves debugging for guard failures/recompilations by: - only running guard fail reason generation during recompilation, instead of when a guard fails during dynamo cache lookup (so generating guard failure reasons is not on the critical path) - ~~always reporting all guard failures~~ Reports the first-failing guard failure for each cache entry. We don't expect a performance hit since the guard fail reasons are only generated at recompile time rather than runtime. Perf benchmark to check this (https://hud.pytorch.org/benchmark/torchbench/inductor_with_cudagraphs?startTime=Fri,%2027%20Oct%202023%2017:42:43%20GMT&stopTime=Fri,%2003%20Nov%202023%2017:42:43%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/williamwen42/62/head&lCommit=f4724f5ffc6d17ceae513a42fc18627be7b85482&rBranch=main&rCommit=29f3d392bf230072e3bffae37b078e770cae1956). We may also need to verify this on benchmarks where guard fails are common. Sample script: ```python import torch def generate_data(b): return ( torch.randn(b, 3, 32, 32).to(torch.float32).cuda(), torch.randint(1000, (b,)).cuda(), ) from torchvision.models import resnet18 def init_model(): return resnet18().to(torch.float32).cuda() model = init_model() model_opt = torch.compile(model, dynamic=False) for b in range(16, 32): data = generate_data(b) model_opt(data[0]) ``` Sample logs: ```bash (/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ python playground5.py /data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( [2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8) [2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284) [2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24 [2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles". [2023-11-06 14:50:47,605] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. (/data/users/williamwen/py310-env) [williamwen@devgpu020.odn1 /data/users/williamwen/pytorch (wwen/log-all-guards)]$ TORCH_LOGS="recompiles" python playground5.py /data/users/williamwen/pytorch/torch/_inductor/compile_fx.py:141: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( [2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:53:31,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 17 [2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 18 [2023-11-06 14:53:41,333] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 18 [2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 19 [2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 19 [2023-11-06 14:53:50,463] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 19 [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 20 [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 20 [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 20 [2023-11-06 14:53:59,848] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 20 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 21 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 21 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 21 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 21 [2023-11-06 14:54:08,549] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 21 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 22 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 22 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 22 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 22 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 22 [2023-11-06 14:54:17,795] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 22 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 22, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 23 [2023-11-06 14:54:27,430] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 23 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function forward in /data/users/williamwen/torchvision/torchvision/models/resnet.py:284 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 23, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 22, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 21, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 20, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 19, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 18, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 17, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 16, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8) [2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/data/users/williamwen/torchvision/torchvision/models/resnet.py:284) [2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['x']' size mismatch at index 0. expected 16, actual 24 [2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles". [2023-11-06 14:54:36,744] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html. [2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:45,922] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 25 [2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 26 [2023-11-06 14:54:54,691] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 26 [2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 27 [2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 27 [2023-11-06 14:55:03,591] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 27 [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 28 [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 28 [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 28 [2023-11-06 14:55:12,384] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 28 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 29 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 29 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 29 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 29 [2023-11-06 14:55:21,442] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 29 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 29, actual 30 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 30 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 30 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 30 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 30 [2023-11-06 14:55:30,315] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 30 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] Recompiling function _forward_impl in /data/users/williamwen/torchvision/torchvision/models/resnet.py:266 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] triggered by the following guard failure(s): [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 30, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 29, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 28, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 27, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 26, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 25, actual 31 [2023-11-06 14:55:39,839] torch._dynamo.guards.__recompiles: [DEBUG] - tensor 'L['x']' size mismatch at index 0. expected 24, actual 31 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110325 Approved by: https://github.com/ezyang, https://github.com/jon-chuang	2023-11-07 20:10:59 +00:00
Peter Bell	65ecb36621	Move ShapeEnv config out of dynamo (#112933 ) Previously there was a circular dependency between fx and dynamo that happened to work out since ShapeEnv didn't access the config at module init time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112933 Approved by: https://github.com/ezyang	2023-11-07 01:10:25 +00:00
Thiago Crepaldi	eefe327b11	Rename torch.onnx.ExportOutput* to ONNXProgram* (#112263 ) Since PyTorch 2.1, torch.export API was introduced and the term "export" got overloaded due to the already existing torch.onnx.export API. The torch.onnx.dynamo_export API was introduced on pyTorch 2.0 and it exposed a torch.onnx.ExportOutput which now can be confused with torch.export.export output To prevent such ambiguity and standardize names around the new torch.export.ExportedProgram, this PR renames torch.onnx.ExportOutput to torch.onnx.ONNXProgram Pull Request resolved: https://github.com/pytorch/pytorch/pull/112263 Approved by: https://github.com/BowenBao ghstack dependencies: #112444	2023-11-06 22:27:15 +00:00
angelayi	ff35e1e45b	[pytree] Add custom treespec fqn field (#112428 ) Custom classes that are serialized with pytree are serialized by default with `f”{class.__module__}.{class.__name__}”`. This is a dependency from our serialized program directly into the outer Python environment. If a user moves the class to a different directory, the serialized program will be unable to be loaded. So, we will require users to pass in an FQN if they want to serialize their custom treespec type. Differential Revision: [D50886366](https://our.internmc.facebook.com/intern/diff/D50886366) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112428 Approved by: https://github.com/suo	2023-11-02 00:26:41 +00:00
Shunting Zhang	a1e222ef02	metric table (#109245 ) In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem. Q: why not log to stdout/err direclty A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful. Q: what's the difference with speedup_indcutor.csv A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful. Example use cases: - As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface: https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 . - To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful. - gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-11-01 02:33:42 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
angelayi	b126adcdee	[aotinductor] Pass TorchIR to AOTInductor (#110020 ) Updates `_export.aot_compile` to pass a torch IR graph to inductor, allowing inductor to now run the pre_grad_passes, and reuse more of inductor's code. Also updates the API to only return the `so_path`, and not returning the exported program. The pytree call spec is now serialized and placed inside of the generated model code. When calling the model, because there is no c++ pytree implementation linked yet, we can access the call specs through `get_call_spec()`, and call pytree flatten/unflattenin python. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110020 Approved by: https://github.com/desertfire	2023-10-26 15:54:31 +00:00
Simon Fan	9e6c97890b	Dynamo runner: add FSDP handcrafted module wrapping policy (#111505 ) The default size based auto wrap policy may not be representative of actual usage of the models. We add support for a few handpicked models, and fallback to the size based policy. sample command: `PYTHONPATH=~/benchmark/ python benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --multiprocess --performance --only nanogpt --fsdp` 1.257x 1.256x 1.257x 1.252x 1.257x 1.262x 1.258x 1.272x Pull Request resolved: https://github.com/pytorch/pytorch/pull/111505 Approved by: https://github.com/H-Huang, https://github.com/xuzhao9	2023-10-25 03:05:31 +00:00
BowenBao	ad4971c0b1	Delete deepcopied model after use in benchmark to reduce memory consumption (#111868 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111868 Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi ghstack dependencies: #111867, #111593	2023-10-24 23:44:14 +00:00
BowenBao	4839f319da	Apply same 'pick_grad' on generating fp64 reference outputs (#111593 ) To lower memory consumption for inference mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111593 Approved by: https://github.com/msaroufim, https://github.com/thiagocrepaldi ghstack dependencies: #111867	2023-10-24 20:16:53 +00:00
BowenBao	ec2e0712db	[ONNX] Enable onnx inlining in benchmark for >2GB models (#111867 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111867 Approved by: https://github.com/thiagocrepaldi	2023-10-24 20:16:53 +00:00
Bin Bao	ce48d36324	[aotinductor] Update test utility to use AOTIModelRunner (#111657 ) Summary: Use AOTIModelRunner provided by libtorch instead of the custom written RAIIModelContainer for testing. This change also makes running AOTInductor benchmarks on CPU possbile. Differential Revision: [D50560764](https://our.internmc.facebook.com/intern/diff/D50560764) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111657 Approved by: https://github.com/chenyang78	2023-10-23 18:21:27 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
BowenBao	e3463fe4ca	[ONNX] Benchmark to store test data along exported model (#111095 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111095 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi	2023-10-19 03:20:52 +00:00
BowenBao	0b14ec8ca6	[ONNX] Add dynamo_onnx_aot_inline to bench (#110183 ) An option that applies onnx.inliner post model export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110183 Approved by: https://github.com/thiagocrepaldi	2023-10-18 00:43:04 +00:00
Shunting Zhang	cc9b7bb85c	[reland] [inductor] fix a max-autotune rng state related bug (#111381 ) reland https://github.com/pytorch/pytorch/pull/109828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111381 Approved by: https://github.com/lezcano	2023-10-17 19:16:36 +00:00
Michael Voznesensky	1e7947b3e0	Revert "Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323 )" + Forward fixes + test (#110964 ) This reverts commit `f786fbdebd`. Forward fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/110964 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2023-10-11 05:16:47 +00:00
angelayi	83061ee177	[aotinductor] Fix benchmarks with self.autocast (#110490 ) Fixes https://github.com/pytorch/pytorch/issues/108173 The original error was that there was a type mismatch between the output of eager mode (float16) and from aot_compile (float32). This is because when we run the model eagerly in the benchmarks, we call [self.model_iter_fn](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2072-L2076) to run the model, rather than directly calling the model. In the case of timm models, it calls the model with [self.autocast()](https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L321-L323), causing the eager model to return a float16 value. However, the model we export with aot_compile does not have the self.autocast context, so it returns float32. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110490 Approved by: https://github.com/desertfire	2023-10-06 02:13:47 +00:00
Xu Zhao	2e31fae5c5	Cleanup the code in the `dynamo` userbenchmark (#110519 ) Summary: Skip importing the modules that are only available in the pytorch source code, not pytorch nightly release. Make dynamo benchmark work on both OSS and internal. X-link: https://github.com/pytorch/benchmark/pull/1960 Test Plan: ``` $ python run_benchmark.py dynamo --only alexnet --training --performance --inductor loading model: 0it [00:05, ?it/s] cuda train alexnet running benchmark: 100%\|█████████████████\| 30/30 [00:00<00:00, 41.46it/s] 1.129x ``` ``` $ buck2 run mode/opt //pytorch/benchmark:run_benchmark -- dynamo --only alexnet --training --inductor --performance --output-directory $HOME loading model: 0it [00:16, ?it/s] running benchmark: 100%\|█████████████████\| 30/30 [00:00<00:00, 37.94it/s] cuda train alexnet 1.120x ``` Differential Revision: D49912006 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110519 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-04 23:26:30 +00:00
Bin Bao	06e88d2cfc	[aotinductor] Remove output_spec from AOTInductorModelCache (#110462 ) Summary: No need to store output_spec as the returned exported.call_spec already contains that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110462 Approved by: https://github.com/angelayi	2023-10-03 22:29:36 +00:00
Simon Fan	88ef126a93	rename nanogpt_generate to nanogpt to also support train (#109746 ) Differential Revision: [D49522940](https://our.internmc.facebook.com/intern/diff/D49522940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109746 Approved by: https://github.com/msaroufim, https://github.com/malfet, https://github.com/xuzhao9	2023-09-29 17:36:48 +00:00
BowenBao	85e408217a	[ONNX] Move out onnx bench bash scripts (#103983 ) Summary: - Remove onnx bench related scripts and `_onnx` folder. - Update `common.py` to include onnx related patches previously under `_onnx` folder. - Update `merge_rules.json` to include bench files. - Added quick sanity onnx bench test to onnx CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103983 Approved by: https://github.com/kit1980	2023-09-27 23:54:26 +00:00
angelayi	57cdad2396	[aotinductor] Update benchmark to include compilation time (#109998 ) Fixes [comment](https://github.com/pytorch/pytorch/pull/109820#pullrequestreview-1638629777) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109998 Approved by: https://github.com/desertfire	2023-09-25 21:30:22 +00:00
angelayi	a565f1bee6	[aotinductor] Skip benchmarks with control flow (#109661 ) Since AOTInductor doesn't support control flow yet, we will skip over tests that are currently failing due to containing control flow in the code. Logs taken from https://hud.pytorch.org/benchmark/compilers?startTime=Tue%2C%2012%20Sep%202023%2022%3A56%3A40%20GMT&stopTime=Tue%2C%2019%20Sep%202023%2022%3A56%3A40%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=main&lCommit=2c1554a0323107d821be3ff13df7833b9f0b960d&rBranch=main&rCommit=47be61e12bd51df27182343d312dc3df485d5559 Errors documented in https://github.com/pytorch/pytorch/issues/105217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109661 Approved by: https://github.com/desertfire	2023-09-25 18:49:06 +00:00
PyTorch MergeBot	d9627c4264	Revert "[inductor] fix a max-autotune rng state related bug (#109828 )" This reverts commit `3663436db3`. Reverted https://github.com/pytorch/pytorch/pull/109828 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the rocm failure looks legit. There is also another numpy import error when running dynamo test on CPU ([comment](https://github.com/pytorch/pytorch/pull/109828#issuecomment-1732423883))	2023-09-23 22:35:37 +00:00
Shunting Zhang	3663436db3	[inductor] fix a max-autotune rng state related bug (#109828 ) Fix https://github.com/pytorch/pytorch/issues/109736 . HF pin move causes regression on accuracy check for HF models on the dashboard. Manually reverting the HF PR ( https://github.com/huggingface/transformers/pull/24696/files ) could recover, but this may hide some real issue. I happen to found that using a warm matmul max-autotune cache can work around the issue. Or putting it in another way: - make all calls to check_cache cache miss repro the issue - make all cals to check_cache cache hit works around the issue I did some sort of 'bisect' to force halving the amount of cache miss each time while still make sure we can repro. Luckily reducing to a single cache miss still repro the issue. With more debugging, it turns out that it's the call to `torch.randn` on cuda device causing the problem. The fix is to make sure we restore the rng state when we generate random inputs for max-autotune benchmarking. TBH, I can not fully explain the root cause although I know it's caused by rng state change. AOTAutograd already has some logic to preserve rng state. And I can not repro the issue in unit tests. I have a few guess why the RNG state is not restored in the first place after we generate random inputs for max-autotune: - maybe AOTAutograd misses some corner case to preserve the rng state - maybe for the failed models, there are some eager fallback that's not handled by inductor. And if those fallback calles random number related APIs, we will see the issue. But again I don't find a good way to simulate this. Repro: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 CUDA_VISIBLE_DEVICES=3 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only PLBartForCausalLM --training --cold-start-latency ``` We always repro the issue without the PR but pass the accuracy check with the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109828 Approved by: https://github.com/eellison	2023-09-23 00:58:10 +00:00
Bin Bao	8856c1628e	[inductor] Change AOTInductor to return output tensors (#109790 ) Summary: Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits: * It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable. * As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance. * With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability. This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela. Differential Revision: D49502318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790 Approved by: https://github.com/chenyang78	2023-09-22 02:31:52 +00:00
Angela Yi	f7ddc54503	[aotinductor] Update performance benchmark code (109560) (#109820 ) Summary: Same as #109560, made a new PR because we need to land from internal Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup. This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs. For example, ``` python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM ``` results in `1.359x` speedup. Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it. [Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8) Test Plan: CI Differential Revision: D49513934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820 Approved by: https://github.com/desertfire	2023-09-21 20:49:41 +00:00
Simon Fan	ef8d461b09	Fix torchbench --multiprocess (#109657 ) `python benchmarks/dynamo/torchbench.py --multiprocess` currently fails due to initializing distributed multiple times: ``` torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:6789 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:6789 (errno: 98 - Address already in use). ``` Because torchbench calls itself via mp.spawn, there is the parent run (with `--multiprocess`) and child runs (with `--multiprocess --only <model>`). This PR addresses this by fixing two issues: 1) distributed is initialized once in parent run and once in child runs, it should be initialized only in child runs where we have accurate rank and world size info 2) torchbench overrides CUDA_VISIBLE_DEVICES/world_size sometimes, but it shouldn't for distributed use cases where we want to use all available gpus I am also adding a CI test to cover this type of issue in #109311 ### Test plan parent run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess` child run test: `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --output /home/xmfan/local/pytorch/test/test-reports/inference_torchbench.csv --multiprocess --only simple_gpt` Pull Request resolved: https://github.com/pytorch/pytorch/pull/109657 Approved by: https://github.com/H-Huang	2023-09-21 16:53:07 +00:00
Mark Saroufim	0ec9f59f70	Loudly Error in dynamo bench if eager fails (#109536 ) Helps debug https://github.com/pytorch/benchmark/issues/1901 I will wait until the ONNX beartype sev is fixed before merging Pull Request resolved: https://github.com/pytorch/pytorch/pull/109536 Approved by: https://github.com/xuzhao9	2023-09-19 00:40:42 +00:00
angelayi	5b13f74e9b	[export] Update how we input kwargs (#109160 ) Previously, the code for passing inputs to exported program was: ``` if kwargs: return (args, kwargs) else: return args ``` However, this causes some inconsistency where if the original input contains args and kwargs, the treespec would be a tuple containing a tuple of arguments, and a dictionary of keyword arguments. But if the original input only contained args, the treespec would just be a tuple of arguments. This inconsistency causes some inconveniences in the runtime. So I updated the code to just always keep the kwargs around. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109160 Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri	2023-09-19 00:04:32 +00:00
Animesh Jain	f786fbdebd	Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109323 Approved by: https://github.com/huydhn, https://github.com/voznesenskym	2023-09-15 08:44:14 +00:00
Simon Fan	54c5f474a7	Forward rank and world size info to Torchbench models when using dynamo runner (#108438 ) Adding support to pass rank and world_size to torchbench model, via its extra_args parameter: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/model.py#L83C80-L83C90 This is used for models which distribute over multiple GPUs e.g. simple_gpt https://github.com/pytorch/benchmark/pull/1867 Also add an option to skip multiprocess only gpu models Testing via `python benchmarks/dynamo/torchbench.py -d cuda --output=benchmark_logs/performance.csv --inference --performance --timing --print-memory --multiprocess --only simple_gpt` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108438 Approved by: https://github.com/Chillee	2023-09-14 21:01:20 +00:00
angelayi	c3945b5f84	Update HF version to commit hash (6c26faa) (#107400 ) Some [errors](https://ossci-raw-job-status.s3.amazonaws.com/log/15968424899) in the [torchinductor hf benchmarks](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Thu,%2010%20Aug%202023%2018:05:47%20GMT&stopTime=Thu,%2017%20Aug%202023%2018:05:47%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=main&lCommit=384e0d104fd077d31efafc564129660e9b7a0f25&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965) should be fixed in the most recent release (for example, this [line](`c036c814f4/src/transformers/models/opt/modeling_opt.py (L688)`) no longer exists). Additionally, I landed a [commit (6c26faa)](`6c26faa159`) to the HF transformers repro to fix one of the graph breaks. This PR results in [76% pass rate for the export + aot inductor HF benchmark!](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2022%3A45%3A09%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2022%3A45%3A09%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/hf_version&lCommit=0accaaca2fa70ca2f78c1a587dd4b6750448dd90&rBranch=main&rCommit=03414081ff7ee011e17ee10f9ddb2584811bf965) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107400 Approved by: https://github.com/ezyang, https://github.com/desertfire, https://github.com/malfet	2023-09-12 15:25:28 +00:00
PyTorch MergeBot	56c2386157	Revert "reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883 )" This reverts commit `d4230e5574`. Reverted https://github.com/pytorch/pytorch/pull/108883 on behalf of https://github.com/huydhn due to Per the discussion thread on D49122208, reverting this change ([comment](https://github.com/pytorch/pytorch/pull/108883#issuecomment-1712707853))	2023-09-10 04:40:02 +00:00
Animesh Jain	d4230e5574	reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108883 Approved by: https://github.com/voznesenskym, https://github.com/huydhn	2023-09-09 03:12:31 +00:00
Bin Bao	e91f66471c	[reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878 Approved by: https://github.com/muchulee8	2023-09-08 17:58:35 +00:00
PyTorch MergeBot	428f5f9e7e	Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 )" This reverts commit `366ce589d0`. Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))	2023-09-08 05:01:27 +00:00
PyTorch MergeBot	72f24d0001	Revert "[dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528 )" This reverts commit `34bb74c4cf`. Reverted https://github.com/pytorch/pytorch/pull/108528 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it has some nasty merge conflicts after the revert of D48910794. I need to revert this so the conflict could be resolved. Please help rebase this tomorrow and reland the change ([comment](https://github.com/pytorch/pytorch/pull/108528#issuecomment-1711034781))	2023-09-08 03:49:41 +00:00
Bin Bao	366ce589d0	[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 ) Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663 Approved by: https://github.com/ezyang ghstack dependencies: #108653	2023-09-07 23:38:11 +00:00
Animesh Jain	34bb74c4cf	[dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528 ) This PR is a 99% copy paste of Sam Gross (@colesbury) work at https://github.com/pytorch/pytorch/pull/100642. Copied from there -------- The NN_MODULE guard now subsumes guards on Module attributes. The check_fn will fail if the module attributes are changed (such as Module.training), parameters, submodules, and buffers are added or removed, and if fields are changed on the type itself. This gives up specificity in the guard check -- if any field is changed the check_fn fails -- for faster overall checks. ----- Pull Request resolved: https://github.com/pytorch/pytorch/pull/108528 Approved by: https://github.com/ezyang	2023-09-07 01:45:47 +00:00

1 2 3 4 5 ...

538 Commits