pytorch/benchmarks
eellison 40a8770154 Incorporate coalesce analysis in codegen (#153751)
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.

In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.

The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.

While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
2025-06-04 00:22:57 +00:00
..
distributed/ddp [BE] Remove outdated RPC benchmark (#146716) 2025-03-29 04:44:36 +00:00
dynamo Incorporate coalesce analysis in codegen (#153751) 2025-06-04 00:22:57 +00:00
fastrnns [BE]: Enable ruff rule SIM113 (#147290) 2025-02-16 22:41:16 +00:00
framework_overhead_benchmark Fix unused Python variables outside torch/ and test/ (#136359) 2024-12-11 17:10:23 +00:00
functional_autograd_benchmark Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
fuser Fix unused Python variables outside torch/ and test/ (#136359) 2024-12-11 17:10:23 +00:00
gpt_fast [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546) 2025-02-27 20:46:16 +00:00
inductor_backends [Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331) 2025-05-29 20:29:58 +00:00
inference [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
instruction_counts [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546) 2025-02-27 20:46:16 +00:00
nested Fix unused Python variables outside torch/ and test/ (#136359) 2024-12-11 17:10:23 +00:00
operator_benchmark add JSON output support for operator benchmark (#154410) 2025-06-03 21:29:24 +00:00
overrides_benchmark [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
profiler_benchmark Apply TorchFix TOR203 fixes (#143691) 2024-12-23 18:21:03 +00:00
record_function_benchmark [Caffe2]Remove Caffe2 scripts and benchmarks (#126747) 2024-06-05 23:46:31 +00:00
serialization Fix unused Python variables outside torch/ and test/ (#136359) 2024-12-11 17:10:23 +00:00
sparse Clean up conda usage in benchmark scripts (#152552) 2025-04-30 21:27:29 +00:00
static_runtime [3/N] Use internal linkage in C++ files (#151297) 2025-05-05 17:48:39 +00:00
tensorexpr Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00
transformer Add sparsity (#148513) 2025-03-07 01:47:52 +00:00
compare-fastrnn-results.py [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
compare.sh
README.md Removing conda references from PyTorch Docs (#152702) 2025-05-20 20:33:28 +00:00
upload_scribe.py Fix broken URLs (#152237) 2025-04-27 09:56:42 +00:00

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
pip3 install torch torchvision

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: