Commit Graph

6 Commits

Author SHA1 Message Date
Alnis Murtovi
add0f0085c AutoHeuristic: Support ranking/pruning choices (#131705)
This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705
Approved by: https://github.com/eellison
2024-08-16 01:20:52 +00:00
Alnis Murtovi
48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00
Oguz Ulgen
a6985c09cb Add None return type to init -- functorch and torchgen (#132351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335
2024-08-01 15:26:45 +00:00
PyTorch MergeBot
a28cda11ef Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613)"
This reverts commit 344c15a0bb.

Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))
2024-08-01 03:22:11 +00:00
Alnis Murtovi
344c15a0bb AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
2024-08-01 02:25:54 +00:00
Alnis Murtovi
d3cefc9e3a AutoHeuristic: Collect data for mixed_mm (#131611)
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:

Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
2024-07-31 20:45:45 +00:00