Commit Graph

534 Commits

Author SHA1 Message Date
PyTorch MergeBot
dbb55b448b Revert "[7/N] Fix Wextra-semi warning (#140225)"
This reverts commit ffb979032d.

Reverted https://github.com/pytorch/pytorch/pull/140225 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/140225#issuecomment-2469312229))
2024-11-12 00:02:06 +00:00
cyy
ffb979032d [7/N] Fix Wextra-semi warning (#140225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225
Approved by: https://github.com/ezyang
2024-11-10 14:28:10 +00:00
xinan.lin
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
xinan.lin
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
angelayi
8c22e09e39 [aoti] Add masked_select to cshim (#139071)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071
Approved by: https://github.com/desertfire
2024-10-31 21:52:53 +00:00
Wu, Chunyuan
9af1816974 [AOTI] add C shim for _weight_int8pack_mm (#138691)
Fixes the error of running WOQ-INT8 LLaMA:
```
E           In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3,
E                            from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4:
E           /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque**, AtenTensorOpaque**)’:
E           /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope
E             117 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle));
E                 |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-10-29 13:53:36 +00:00
Richard Barnes
068f7e7a78 torch::optional -> std::optional (#138987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987
Approved by: https://github.com/Skylion007
2024-10-28 19:09:46 +00:00
Richard Barnes
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
Richard Barnes
dbf0fa811a Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479)
BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479
Approved by: https://github.com/eqy, https://github.com/malfet
2024-10-24 07:51:05 +00:00
Aaron Gokaslan
195d0a666b [BE][Ez]: Use interned hardcoded string FURB156 (#138330)
Uses string constants from string module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330
Approved by: https://github.com/albanD
2024-10-18 18:26:16 +00:00
Edward Yang
b14269dcfb Make Context to be Device-agnostic Step by Step (1/N) (#136519) (#138155)
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Original pull request: https://github.com/pytorch/pytorch/pull/136519

Test Plan: contbuild & OSS CI, see 4a8e49389c

Reviewed By: malfet

Differential Revision: D64471142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet, https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
PyTorch MergeBot
d4d687ffb2 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit 4a8e49389c.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:16 +00:00
Wang, Eikan
5689e33cfe [Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794)
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.

https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99

Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
2024-10-15 15:31:37 +00:00
FFFrog
4a8e49389c Make Context to be Device-agnostic Step by Step (1/N) (#136519)
----

- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-13 12:38:02 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
PyTorch MergeBot
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
FFFrog
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
PyTorch MergeBot
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d8260745.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
Jeff Daily
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
Tarun Karuturi
f42f63ee86 Add option to disable operator profiling (#136838)
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720

For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.

To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```

Test Plan: Added test case.

Differential Revision: D61883224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
2024-10-04 22:56:00 +00:00
Bin Bao
15c3479db7 [AOTI] Fix _scaled_mm ABI-compatible codegen (#137132)
Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode.

Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132
Approved by: https://github.com/chenyang78
ghstack dependencies: #137008
2024-10-04 14:05:18 +00:00
ZhiweiYan-96
a7a53b796b [Intel GPU]device guard codegen for XPU (#133980)
This PR is a supplement to #130082. The previous PR  #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts.  Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-09-05 01:53:31 +00:00
cyy
1595e755af [Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549)
Reland of #121415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549
Approved by: https://github.com/ezyang
2024-09-01 15:15:38 +00:00
Manuel Candales
caa04e0cae [ET] codegen: bool array as array ref (#134886)
Test Plan: CI

Differential Revision: D62046959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886
Approved by: https://github.com/larryliu0820
2024-09-01 01:33:43 +00:00
Manuel Candales
cae817c862 [ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245)
Summary:
Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations

These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h)

Test Plan: CI

Differential Revision: D61490943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245
Approved by: https://github.com/larryliu0820
2024-08-28 19:58:37 +00:00
chilli
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
Yidi Wu
6835f20d20 [HOP] support generating schema for hop (#133521)
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.

We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
2024-08-21 17:34:21 +00:00
Alnis Murtovi
8b8b4e5ae9 AutoHeuristic: documentation for mm (#133611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714, #133608
2024-08-16 16:20:38 +00:00
Alnis Murtovi
0e0077f3b6 AutoHeuristic: mm ranking heuristic h100 (#133608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714
2024-08-16 16:20:38 +00:00
Alnis Murtovi
e51c8ad369 AutoHeuristic: Heuristic that ranks choices for mm (#131714)
This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included.

Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710
2024-08-16 16:20:38 +00:00
Alnis Murtovi
add0f0085c AutoHeuristic: Support ranking/pruning choices (#131705)
This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705
Approved by: https://github.com/eellison
2024-08-16 01:20:52 +00:00
Alnis Murtovi
5dfb22d4c8 AutoHeuristic: tests (#133496)
This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496
Approved by: https://github.com/eellison
2024-08-15 19:22:44 +00:00
Alnis Murtovi
9876aa39c0 AutoHeuristic: pad_mm documentation (#133411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411
Approved by: https://github.com/Chillee
ghstack dependencies: #133409, #133410
2024-08-15 10:49:56 +00:00
Alnis Murtovi
f32a9e953f AutoHeuristic: mixed_mm documentation (#133410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410
Approved by: https://github.com/Chillee
ghstack dependencies: #133409
2024-08-15 10:49:56 +00:00
Alnis Murtovi
142353eca3 AutoHeuristic: util scripts (#133409)
This PR introduces scripts that make it easier to use autoheuristic:
- `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU.
- `merge_data.py`: This script can be used to merge multiple training data files into a single file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409
Approved by: https://github.com/Chillee
2024-08-15 10:49:56 +00:00
Alnis Murtovi
448d54ee92 AutoHeuristic: instructions (#132894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894
Approved by: https://github.com/Chillee
2024-08-15 04:54:54 +00:00
Yuanhao Ji
378b12f3ad Improve namespace for c10::MemoryFormat::Contiguous in torchgen/api/cpp.py (#131622)
Top-level namespaces are more convenient for out-of-tree device extensions.

For example, now we have a patch for it in `torch_npu`:

98c50ced16/codegen/gen_backend_stubs.py (L772-L778)

```python
JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622
Approved by: https://github.com/zou3519
2024-08-14 14:41:01 +00:00
Alnis Murtovi
f1c439cbed AutoHeuristic: refactoring (#133170)
This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170
Approved by: https://github.com/Chillee
2024-08-13 01:46:53 +00:00
Alnis Murtovi
21302d5891 AutoHeuristic: script to generate data for mm (#131617)
This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617
Approved by: https://github.com/eellison
ghstack dependencies: #131615, #131616
2024-08-09 23:49:29 +00:00
Alnis Murtovi
383f2ac914 AutoHeuristic: mixed_mm H100 heuristic (#132685)
H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic.
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup  max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     1562    604     145   2311         1.522201          1.077722          10.399141            3.134170              1.034802               2061               2
 test  entropy          5              0.01      361    164      24    549         1.443590          1.079169           8.159173            3.105360              1.197973                500               2
```

gpt-fast speedups
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      |      109.95  |       220.63|  2      |
|     1    |     11      |      109.65  | 	    210.92|  1.92   |
|     4    |      7      |       149.04 |       625.80|  4.19   |
|     4    |     11      |       149.56 |       494.64|  3.30   |
|     8    |      7      |       293.68 |       956.72|  3.25   |
|     8    |     11      |       294.48 |       925.60|  3.14   |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685
Approved by: https://github.com/eellison
2024-08-07 23:48:01 +00:00
Alnis Murtovi
48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00
cyy
b9cb1abf65 [12/N] Use std::optional (#132361)
Follows #132396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361
Approved by: https://github.com/eqy
2024-08-02 13:46:46 +00:00
Oguz Ulgen
a6985c09cb Add None return type to init -- functorch and torchgen (#132351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335
2024-08-01 15:26:45 +00:00
PyTorch MergeBot
a28cda11ef Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613)"
This reverts commit 344c15a0bb.

Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))
2024-08-01 03:22:11 +00:00
Alnis Murtovi
344c15a0bb AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
2024-08-01 02:25:54 +00:00
Alnis Murtovi
d3cefc9e3a AutoHeuristic: Collect data for mixed_mm (#131611)
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:

Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
2024-07-31 20:45:45 +00:00
JackCaoG
b40249b462 propagate XLA's metadata after functional sync (#131076)
Fixes https://github.com/pytorch/xla/issues/7174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076
Approved by: https://github.com/bdhirsh
2024-07-31 18:20:00 +00:00
Yan Zhiwei
fe4f8e97cd [Intel GPU] xpu-ops codegen via backend whitelist (#130082)
# Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels).  Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.

# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`

We use the following commands to generate XPU operators

` python -m torchgen.gen --source-path path/to/yaml/of/xpu   --install-dir  build/xpu    --per-operator-headers    --static-dispatch-backend     --backend-whitelist=XPU`

The diff lies at `backend-whitelist=XPU`.  The backend-whitelist key is an existent argument in torchgen.

The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`

# Result

All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend.  XPU operators only include headers in this folder.

# Verification

* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
2024-07-31 16:31:38 +00:00
Nick Westlake
7124efa81b Include _native.h for structured_native_functions (#131208)
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so

This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
2024-07-24 02:55:36 +00:00