pytorch/benchmarks
Jerry Zhang a962ae511d Extend gpt-fast LLM dashboard to support torchao autoquant (#140627)
Summary:
We want to test autoquant on relevant LLM models

right now only llama2 and mixtral, but want to extend to more models like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models

Test Plan:

```
                                            Llama-2-7b-chat-hf          Mixtral-8x7B-v0.1
gpt-fast int8                           112.98                              147.92
torchao autoquant                  87.41                               85.90
torchao autoquantv2             131.12                                79.59
```

https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch

in pytorch/benchmarks/gpt_fast
```
python benchmark.py
```

output:
```
Loading model Llama-2-7b-chat-hf
Using int8 weight-only quantization!
Time to load model: 2.80 seconds
Compilation time: 170.24 seconds
Average tokens/sec: 112.98 tokens/sec
Average bandwidth achieved: 746.86 GB/s
Memory used: 7.95 GB

Loading model Mixtral-8x7B-v0.1
Using int8 weight-only quantization!
Time to load model: 0.24 seconds
Compilation time: 181.81 seconds
Average tokens/sec: 147.92 tokens/sec
Average bandwidth achieved: 953.06 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Using autoquant
Compilation time: 109.31 seconds
Average tokens/sec: 87.17 tokens/sec
Average bandwidth achieved: 1151.86 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Compilation time: 48.08 seconds
Average tokens/sec: 87.41 tokens/sec
Average bandwidth achieved: 1155.05 GB/s
Memory used: 36.86 GB

Loading model Mixtral-8x7B-v0.1
Time to load model: 0.20 seconds
Using autoquant
Compilation time: 47.32 seconds
Average tokens/sec: 85.90 tokens/sec
Average bandwidth achieved: 1106.37 GB/s
Memory used: 66.81 GB

local test (autoquant v2):
Loading model Mixtral-8x7B-v0.1
Compilation time: 124.40 seconds
Average tokens/sec: 90.41 tokens/sec
Average bandwidth achieved: 1164.47 GB/s
Memory used: 53.91 GB

Loading model Llama-2-7b-chat-hf
TODO

```

gpt_fast_benchmark.csv:
```
name,metric,target,actual,dtype,device,arch,is_model
Llama-2-7b-chat-hf,token_per_sec,144,112.98,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,746.86,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,170.24,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,147.92,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,953.06,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,181.81,int8,cuda,NVIDIA PG509-210,True
gemv,memory_bandwidth(GB/s),870,867.06,int8,cuda,NVIDIA PG509-210,False
gemv,memory_bandwidth(GB/s),990,1092.43,bfloat16,cuda,NVIDIA PG509-210,False
layer_norm,memory_bandwidth(GB/s),950,573.57,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,144,87.17,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,1151.86,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,109.31,autoquant,cuda,NVIDIA PG509-210,True
gather_gemv,memory_bandwidth(GB/s),990,945.38,int8,cuda,NVIDIA PG509-210,False
gather_gemv,memory_bandwidth(GB/s),1060,1188.29,bfloat16,cuda,NVIDIA PG509-210,False
mlp_layer_norm_gelu,flops_utilization,0.8,0.82,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,94,87.41,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,1155.05,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),133,48.08,bfloat16,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,85.90,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,1106.37,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,47.32,autoquant,cuda,NVIDIA PG509-210,True
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140627
Approved by: https://github.com/huydhn
2024-11-27 21:57:48 +00:00
..
distributed [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577) 2024-10-11 18:30:26 +00:00
dynamo Enable autograd cache on inductor tests (#140890) 2024-11-27 20:41:43 +00:00
fastrnns [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577) 2024-10-11 18:30:26 +00:00
framework_overhead_benchmark [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
functional_autograd_benchmark [BE]: Add better optional typing (#138426) 2024-10-27 14:19:00 +00:00
fuser [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
gpt_fast Extend gpt-fast LLM dashboard to support torchao autoquant (#140627) 2024-11-27 21:57:48 +00:00
inference [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
instruction_counts [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577) 2024-10-11 18:30:26 +00:00
nested Apply UFMT to all files in benchmarks/ (#105928) 2023-07-26 01:18:48 +00:00
operator_benchmark [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577) 2024-10-11 18:30:26 +00:00
overrides_benchmark [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
profiler_benchmark [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
record_function_benchmark [Caffe2]Remove Caffe2 scripts and benchmarks (#126747) 2024-06-05 23:46:31 +00:00
serialization [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
sparse [sparse] add extra options to _cslt_spare_mm (#137427) 2024-11-27 05:32:45 +00:00
static_runtime [9/N] Replace c10::optional with std::optional (#130674) 2024-07-15 00:48:43 +00:00
tensorexpr [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
transformer [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577) 2024-10-11 18:30:26 +00:00
compare-fastrnn-results.py [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754) 2024-07-17 14:34:42 +00:00
compare.sh
README.md Add more child links to benchmark readme (#104627) 2023-07-06 12:11:00 +00:00
upload_scribe.py Apply UFMT to all files in benchmarks/ (#105928) 2023-07-26 01:18:48 +00:00

PyTorch Benchmarks

This folder contains scripts that produce reproducible timings of various PyTorch features.

It also provides mechanisms to compare PyTorch with other frameworks.

Setup environment

Make sure you're on a machine with CUDA, torchvision, and pytorch installed. Install in the following order:

# Install torchvision. It comes with the pytorch stable release binary
conda install pytorch torchvision -c pytorch

# Install the latest pytorch master from source.
# It should supersede the installation from the release binary.
cd $PYTORCH_HOME
python setup.py build develop

# Check the pytorch installation version
python -c "import torch; print(torch.__version__)"

Benchmark List

Please refer to each subfolder to discover each benchmark suite. Links are provided where descriptions exist: