mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

History

eellison 40a8770154 Incorporate coalesce analysis in codegen (#153751 ) This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730, #153748		2025-06-04 00:22:57 +00:00
..
distributed/ddp	[BE] Remove outdated RPC benchmark (#146716 )	2025-03-29 04:44:36 +00:00
dynamo	Incorporate coalesce analysis in codegen (#153751 )	2025-06-04 00:22:57 +00:00
fastrnns	[BE]: Enable ruff rule SIM113 (#147290 )	2025-02-16 22:41:16 +00:00
framework_overhead_benchmark	Fix unused Python variables outside torch/ and test/ (#136359 )	2024-12-11 17:10:23 +00:00
functional_autograd_benchmark	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
fuser	Fix unused Python variables outside torch/ and test/ (#136359 )	2024-12-11 17:10:23 +00:00
gpt_fast	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 )	2025-02-27 20:46:16 +00:00
inductor_backends	[Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331 )	2025-05-29 20:29:58 +00:00
inference	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 )	2024-07-17 14:34:42 +00:00
instruction_counts	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 )	2025-02-27 20:46:16 +00:00
nested	Fix unused Python variables outside torch/ and test/ (#136359 )	2024-12-11 17:10:23 +00:00
operator_benchmark	add JSON output support for operator benchmark (#154410 )	2025-06-03 21:29:24 +00:00
overrides_benchmark	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 )	2024-07-17 14:34:42 +00:00
profiler_benchmark	Apply TorchFix TOR203 fixes (#143691 )	2024-12-23 18:21:03 +00:00
record_function_benchmark	[Caffe2]Remove Caffe2 scripts and benchmarks (#126747 )	2024-06-05 23:46:31 +00:00
serialization	Fix unused Python variables outside torch/ and test/ (#136359 )	2024-12-11 17:10:23 +00:00
sparse	Clean up conda usage in benchmark scripts (#152552 )	2025-04-30 21:27:29 +00:00
static_runtime	[3/N] Use internal linkage in C++ files (#151297 )	2025-05-05 17:48:39 +00:00
tensorexpr	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00
transformer	Add sparsity (#148513 )	2025-03-07 01:47:52 +00:00
compare-fastrnn-results.py	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 )	2024-07-17 14:34:42 +00:00
compare.sh
README.md	Removing conda references from PyTorch Docs (#152702 )	2025-05-20 20:33:28 +00:00
upload_scribe.py	Fix broken URLs (#152237 )	2025-04-27 09:56:42 +00:00

README.md

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
pip3 install torch torchvision

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: