This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error. Implementation details: - The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`. - InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end. - When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564 Approved by: https://github.com/jansel ghstack dependencies: #154283, #154289, #154782, #155166 |
||
|---|---|---|
| .. | ||
| ci_expected_accuracy | ||
| microbenchmarks | ||
| pr_time_benchmarks | ||
| __init__.py | ||
| all_torchbench_models_list.txt | ||
| benchmarks.py | ||
| cachebench.py | ||
| check_accuracy.py | ||
| check_csv.py | ||
| check_graph_breaks.py | ||
| check_memory_compression_ratio.py | ||
| check_perf_csv.py | ||
| combine_csv.py | ||
| common.py | ||
| dist_util.py | ||
| distributed.py | ||
| expected_ci_perf_inductor_torchbench.csv | ||
| expected_ci_speedup_inductor_torchbench_cpu.csv | ||
| huggingface_models_list_cpu.txt | ||
| huggingface_models_list.txt | ||
| huggingface.py | ||
| huggingface.yaml | ||
| join_results.py | ||
| Makefile | ||
| parse_logs.py | ||
| README.md | ||
| run_all.sh | ||
| run_delta.sh | ||
| runner.py | ||
| summarize_perf.py | ||
| test.py | ||
| timm_models_list_cpu.txt | ||
| timm_models_list.txt | ||
| timm_models.py | ||
| timm_models.yaml | ||
| torchao_backend.py | ||
| torchbench_models_list_cpu.txt | ||
| torchbench_models_list.txt | ||
| torchbench.py | ||
| torchbench.yaml | ||
| training_loss.py | ||
torch.compile() Benchmarking
This directory contains benchmarking code for TorchDynamo and many backends including TorchInductor. It includes three main benchmark suites:
-
TorchBenchmark: A diverse set of models, initially seeded from highly cited research models as ranked by Papers With Code. See torchbench installation and
torchbench.pyfor the low-level runner. Makefile also contains the commands needed to setup TorchBenchmark to match the versions used in PyTorch CI. -
Models from HuggingFace: Primarily transformer models, with representative models chosen for each category available. The low-level runner (
huggingface.py) automatically downloads and installs the needed dependencies on first run. -
Models from TIMM: Primarily vision models, with representative models chosen for each category available. The low-level runner (
timm_models.py) automatically downloads and installs the needed dependencies on first run.
GPU Performance Dashboard
Daily results from the benchmarks here are available in the TorchInductor Performance Dashboard, currently run on an NVIDIA A100 GPU.
The inductor-perf-test-nightly.yml workflow generates the data in the performance dashboard. If you have the needed permissions, you can benchmark your own branch on the PyTorch GitHub repo by:
- Select "Run workflow" in the top right of the workflow
- Select your branch you want to benchmark
- Choose the options (such as training vs inference)
- Click "Run workflow"
- Wait for the job to complete (4 to 12 hours depending on backlog)
- Go to the dashboard
- Select your branch and commit at the top of the dashboard
The dashboard compares two commits a "Base Commit" and a "New Commit".
An entry such as 2.38x → 2.41x means that the performance improved
from 2.38x in the base to 2.41x in the new commit. All performance
results are normalized to eager mode PyTorch (1x), and higher is better.
CPU Performance Dashboard
The TorchInductor CPU Performance Dashboard is tracked on a GitHub issue and updated periodically.
Running Locally
Raw commands used to generate the data for the performance dashboards can be found here.
To summarize there are three scripts to run each set of benchmarks:
./benchmarks/dynamo/torchbench.py ..../benchmarks/dynamo/huggingface.py ..../benchmarks/dynamo/timm_models.py ...
Each of these scripts takes the same set of arguments. The ones used by dashboards are:
--accuracyor--performance: selects between checking correctness and measuring speedup (both are run for dashboard).--trainingor--inference: selects between measuring training or inference (both are run for dashboard).--device=cudaor--device=cpu: selects device to measure.--amp,--bfloat16,--float16,--float32: selects precision to use--ampis used for training and--bfloat16for inference.--cold-start-latency: disables caching to accurately measure compile times.--backend=inductor: selects TorchInductor as the compiler backend to measure. Many more are available, see--help.--output=<filename>.csv: where to write results to.--dynamic-shapes --dynamic-batch-only: used when thedynamicconfig is enabled.--disable-cudagraphs: used by configurations without cudagraphs enabled (default).--freezing: enable additional inference-only optimizations.--cpp-wrapper: enable C++ wrapper code to lower overheads.TORCHINDUCTOR_MAX_AUTOTUNE=1(environment variable): used to measure max-autotune mode, which is run weekly due to longer compile times.--export-aot-inductor: benchmarks ahead-of-time compilation mode.--total-partitionsand--partition-id: used to parallel benchmarking across different machines.
For debugging you can run just a single benchmark by adding the --only=<NAME> flag.
A complete list of options can be seen by running each of the runners with the --help flag.
As an example, the commands to run first line of the dashboard (performance only) would be:
./benchmarks/dynamo/torchbench.py --performance --training --amp --backend=inductor --output=torchbench_training.csv
./benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend=inductor --output=torchbench_inference.csv
./benchmarks/dynamo/huggingface.py --performance --training --amp --backend=inductor --output=huggingface_training.csv
./benchmarks/dynamo/huggingface.py --performance --inference --bfloat16 --backend=inductor --output=huggingface_inference.csv
./benchmarks/dynamo/timm_models.py --performance --training --amp --backend=inductor --output=timm_models_training.csv
./benchmarks/dynamo/timm_models.py --performance --inference --bfloat16 --backend=inductor --output=timm_models_inference.csv