pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	dbb55b448b	Revert "[7/N] Fix Wextra-semi warning (#140225 )" This reverts commit `ffb979032d`. Reverted https://github.com/pytorch/pytorch/pull/140225 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/140225#issuecomment-2469312229))	2024-11-12 00:02:06 +00:00
cyy	ffb979032d	[7/N] Fix Wextra-semi warning (#140225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225 Approved by: https://github.com/ezyang	2024-11-10 14:28:10 +00:00
xinan.lin	191971e01d	[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742 ) [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU. ### Motivation Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced. ### Design To extend the c shim with more OP for a backend from out-of-tree. The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree. The generated c shim is stored in the `extend` subdirectory , for example: ``` torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp ``` example usage: `python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim ` `--xpu`: generate c shim for XPU `--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`) extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`) `--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742 Approved by: https://github.com/EikanWang, https://github.com/desertfire ghstack dependencies: #139025	2024-11-09 13:19:52 +00:00
xinan.lin	929a647363	[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025 ) [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops. Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part，since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well. At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-11-09 13:09:27 +00:00
cyy	419a7e197d	[6/N] Fix Wextra-semi warning (#139605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605 Approved by: https://github.com/ezyang	2024-11-04 13:43:16 +00:00
angelayi	8c22e09e39	[aoti] Add masked_select to cshim (#139071 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071 Approved by: https://github.com/desertfire	2024-10-31 21:52:53 +00:00
Wu, Chunyuan	9af1816974	[AOTI] add C shim for _weight_int8pack_mm (#138691 ) Fixes the error of running WOQ-INT8 LLaMA: ``` E In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3, E from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque, AtenTensorOpaque)’: E /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope E 117 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle)); E \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-10-29 13:53:36 +00:00
Richard Barnes	068f7e7a78	torch::optional -> std::optional (#138987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987 Approved by: https://github.com/Skylion007	2024-10-28 19:09:46 +00:00
Richard Barnes	42994234a6	std::value/std::type -> std::_v/std::_t (#138746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746 Approved by: https://github.com/cyyever, https://github.com/malfet	2024-10-26 20:59:24 +00:00
Richard Barnes	dbf0fa811a	Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479 ) BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479 Approved by: https://github.com/eqy, https://github.com/malfet	2024-10-24 07:51:05 +00:00
Aaron Gokaslan	195d0a666b	[BE][Ez]: Use interned hardcoded string FURB156 (#138330 ) Uses string constants from string module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330 Approved by: https://github.com/albanD	2024-10-18 18:26:16 +00:00
Edward Yang	b14269dcfb	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) (#138155 ) Summary: - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Original pull request: https://github.com/pytorch/pytorch/pull/136519 Test Plan: contbuild & OSS CI, see `4a8e49389c` Reviewed By: malfet Differential Revision: D64471142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155 Approved by: https://github.com/malfet, https://github.com/bobrenjc93	2024-10-17 20:58:56 +00:00
PyTorch MergeBot	d4d687ffb2	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit `4a8e49389c`. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))	2024-10-15 17:19:16 +00:00
Wang, Eikan	5689e33cfe	[Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794 ) Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`. https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99 Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794 Approved by: https://github.com/atalman ghstack dependencies: #137873	2024-10-15 15:31:37 +00:00
FFFrog	4a8e49389c	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) ---- - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-13 12:38:02 +00:00
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
PyTorch MergeBot	079f909263	Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519 )" This reverts commit `be0b75256a`. Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))	2024-10-10 18:32:17 +00:00
FFFrog	be0b75256a	Make Context to be Device-agnostic Step by Step (1/N) (#136519 ) - make init to be device-agnostic and move it to AcceleratorHooksInterface - refactoring context related to device initialization Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey	2024-10-09 02:13:36 +00:00
PyTorch MergeBot	7e8dace0de	Revert "[ROCm] remove caffe2 from hipify (#137157 )" This reverts commit `40d8260745`. Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))	2024-10-08 17:45:45 +00:00
Jeff Daily	40d8260745	[ROCm] remove caffe2 from hipify (#137157 ) - Remove all "MasqueradingAsCUDA" files and classes. - Do not rename "CUDA" classes to "HIP". Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157 Approved by: https://github.com/eqy	2024-10-05 12:48:54 +00:00
Tarun Karuturi	f42f63ee86	Add option to disable operator profiling (#136838 ) Summary: X-link: https://github.com/pytorch/executorch/pull/5720 For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time. To disable operator profiling users need to do: ``` etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling); ``` Test Plan: Added test case. Differential Revision: D61883224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838 Approved by: https://github.com/dbort	2024-10-04 22:56:00 +00:00
Bin Bao	15c3479db7	[AOTI] Fix _scaled_mm ABI-compatible codegen (#137132 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132 Approved by: https://github.com/chenyang78 ghstack dependencies: #137008	2024-10-04 14:05:18 +00:00
ZhiweiYan-96	a7a53b796b	[Intel GPU]device guard codegen for XPU (#133980 ) This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-09-05 01:53:31 +00:00
cyy	1595e755af	[Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549 ) Reland of #121415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549 Approved by: https://github.com/ezyang	2024-09-01 15:15:38 +00:00
Manuel Candales	caa04e0cae	[ET] codegen: bool array as array ref (#134886 ) Test Plan: CI Differential Revision: D62046959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886 Approved by: https://github.com/larryliu0820	2024-09-01 01:33:43 +00:00
Manuel Candales	cae817c862	[ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245 ) Summary: Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h) Test Plan: CI Differential Revision: D61490943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245 Approved by: https://github.com/larryliu0820	2024-08-28 19:58:37 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
Yidi Wu	6835f20d20	[HOP] support generating schema for hop (#133521 ) Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop. We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521 Approved by: https://github.com/zou3519	2024-08-21 17:34:21 +00:00
Alnis Murtovi	8b8b4e5ae9	AutoHeuristic: documentation for mm (#133611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714, #133608	2024-08-16 16:20:38 +00:00
Alnis Murtovi	0e0077f3b6	AutoHeuristic: mm ranking heuristic h100 (#133608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714	2024-08-16 16:20:38 +00:00
Alnis Murtovi	e51c8ad369	AutoHeuristic: Heuristic that ranks choices for mm (#131714 ) This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included. Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710	2024-08-16 16:20:38 +00:00
Alnis Murtovi	add0f0085c	AutoHeuristic: Support ranking/pruning choices (#131705 ) This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705 Approved by: https://github.com/eellison	2024-08-16 01:20:52 +00:00
Alnis Murtovi	5dfb22d4c8	AutoHeuristic: tests (#133496 ) This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496 Approved by: https://github.com/eellison	2024-08-15 19:22:44 +00:00
Alnis Murtovi	9876aa39c0	AutoHeuristic: pad_mm documentation (#133411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411 Approved by: https://github.com/Chillee ghstack dependencies: #133409, #133410	2024-08-15 10:49:56 +00:00
Alnis Murtovi	f32a9e953f	AutoHeuristic: mixed_mm documentation (#133410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410 Approved by: https://github.com/Chillee ghstack dependencies: #133409	2024-08-15 10:49:56 +00:00
Alnis Murtovi	142353eca3	AutoHeuristic: util scripts (#133409 ) This PR introduces scripts that make it easier to use autoheuristic: - `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU. - `merge_data.py`: This script can be used to merge multiple training data files into a single file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409 Approved by: https://github.com/Chillee	2024-08-15 10:49:56 +00:00
Alnis Murtovi	448d54ee92	AutoHeuristic: instructions (#132894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894 Approved by: https://github.com/Chillee	2024-08-15 04:54:54 +00:00
Yuanhao Ji	378b12f3ad	Improve namespace for `c10::MemoryFormat::Contiguous` in `torchgen/api/cpp.py` (#131622 ) Top-level namespaces are more convenient for out-of-tree device extensions. For example, now we have a patch for it in `torch_npu`: `98c50ced16/codegen/gen_backend_stubs.py (L772-L778)` ```python JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622 Approved by: https://github.com/zou3519	2024-08-14 14:41:01 +00:00
Alnis Murtovi	f1c439cbed	AutoHeuristic: refactoring (#133170 ) This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170 Approved by: https://github.com/Chillee	2024-08-13 01:46:53 +00:00
Alnis Murtovi	21302d5891	AutoHeuristic: script to generate data for mm (#131617 ) This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617 Approved by: https://github.com/eellison ghstack dependencies: #131615, #131616	2024-08-09 23:49:29 +00:00
Alnis Murtovi	383f2ac914	AutoHeuristic: mixed_mm H100 heuristic (#132685 ) H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic. ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 1562 604 145 2311 1.522201 1.077722 10.399141 3.134170 1.034802 2061 2 test entropy 5 0.01 361 164 24 549 1.443590 1.079169 8.159173 3.105360 1.197973 500 2 ``` gpt-fast speedups \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 109.95 \| 220.63\| 2 \| \| 1 \| 11 \| 109.65 \| 210.92\| 1.92 \| \| 4 \| 7 \| 149.04 \| 625.80\| 4.19 \| \| 4 \| 11 \| 149.56 \| 494.64\| 3.30 \| \| 8 \| 7 \| 293.68 \| 956.72\| 3.25 \| \| 8 \| 11 \| 294.48 \| 925.60\| 3.14 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685 Approved by: https://github.com/eellison	2024-08-07 23:48:01 +00:00
Alnis Murtovi	48929184e9	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison	2024-08-02 13:54:37 +00:00
cyy	b9cb1abf65	[12/N] Use std::optional (#132361 ) Follows #132396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361 Approved by: https://github.com/eqy	2024-08-02 13:46:46 +00:00
Oguz Ulgen	a6985c09cb	Add None return type to init -- functorch and torchgen (#132351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335	2024-08-01 15:26:45 +00:00
PyTorch MergeBot	a28cda11ef	Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613 )" This reverts commit `344c15a0bb`. Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))	2024-08-01 03:22:11 +00:00
Alnis Murtovi	344c15a0bb	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison ghstack dependencies: #131610, #131611	2024-08-01 02:25:54 +00:00
Alnis Murtovi	d3cefc9e3a	AutoHeuristic: Collect data for mixed_mm (#131611 ) This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things: Move pad_mm related AutoHeuristic files into subdirectory Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py). The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611 Approved by: https://github.com/eellison ghstack dependencies: #131610	2024-07-31 20:45:45 +00:00
JackCaoG	b40249b462	propagate XLA's metadata after functional sync (#131076 ) Fixes https://github.com/pytorch/xla/issues/7174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076 Approved by: https://github.com/bdhirsh	2024-07-31 18:20:00 +00:00
Yan Zhiwei	fe4f8e97cd	[Intel GPU] xpu-ops codegen via backend whitelist (#130082 ) # Motivation This PR intends to enhance the codegen to allow generate codes for XPU backend. XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts. We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes. # Usage XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops` We use the following commands to generate XPU operators ` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU` The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen. The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten` # Result All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder. # Verification * In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #130019	2024-07-31 16:31:38 +00:00
Nick Westlake	7124efa81b	Include _native.h for structured_native_functions (#131208 ) In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups". Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208 Approved by: https://github.com/bdhirsh	2024-07-24 02:55:36 +00:00

1 2 3 4 5 ...

534 Commits