pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	1595e755af	[Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549 ) Reland of #121415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549 Approved by: https://github.com/ezyang	2024-09-01 15:15:38 +00:00
Manuel Candales	caa04e0cae	[ET] codegen: bool array as array ref (#134886 ) Test Plan: CI Differential Revision: D62046959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886 Approved by: https://github.com/larryliu0820	2024-09-01 01:33:43 +00:00
Manuel Candales	cae817c862	[ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245 ) Summary: Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h) Test Plan: CI Differential Revision: D61490943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245 Approved by: https://github.com/larryliu0820	2024-08-28 19:58:37 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
Yidi Wu	6835f20d20	[HOP] support generating schema for hop (#133521 ) Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop. We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521 Approved by: https://github.com/zou3519	2024-08-21 17:34:21 +00:00
Alnis Murtovi	8b8b4e5ae9	AutoHeuristic: documentation for mm (#133611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714, #133608	2024-08-16 16:20:38 +00:00
Alnis Murtovi	0e0077f3b6	AutoHeuristic: mm ranking heuristic h100 (#133608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714	2024-08-16 16:20:38 +00:00
Alnis Murtovi	e51c8ad369	AutoHeuristic: Heuristic that ranks choices for mm (#131714 ) This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included. Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710	2024-08-16 16:20:38 +00:00
Alnis Murtovi	add0f0085c	AutoHeuristic: Support ranking/pruning choices (#131705 ) This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705 Approved by: https://github.com/eellison	2024-08-16 01:20:52 +00:00
Alnis Murtovi	5dfb22d4c8	AutoHeuristic: tests (#133496 ) This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496 Approved by: https://github.com/eellison	2024-08-15 19:22:44 +00:00
Alnis Murtovi	9876aa39c0	AutoHeuristic: pad_mm documentation (#133411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411 Approved by: https://github.com/Chillee ghstack dependencies: #133409, #133410	2024-08-15 10:49:56 +00:00
Alnis Murtovi	f32a9e953f	AutoHeuristic: mixed_mm documentation (#133410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410 Approved by: https://github.com/Chillee ghstack dependencies: #133409	2024-08-15 10:49:56 +00:00
Alnis Murtovi	142353eca3	AutoHeuristic: util scripts (#133409 ) This PR introduces scripts that make it easier to use autoheuristic: - `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU. - `merge_data.py`: This script can be used to merge multiple training data files into a single file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409 Approved by: https://github.com/Chillee	2024-08-15 10:49:56 +00:00
Alnis Murtovi	448d54ee92	AutoHeuristic: instructions (#132894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894 Approved by: https://github.com/Chillee	2024-08-15 04:54:54 +00:00
Yuanhao Ji	378b12f3ad	Improve namespace for `c10::MemoryFormat::Contiguous` in `torchgen/api/cpp.py` (#131622 ) Top-level namespaces are more convenient for out-of-tree device extensions. For example, now we have a patch for it in `torch_npu`: `98c50ced16/codegen/gen_backend_stubs.py (L772-L778)` ```python JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622 Approved by: https://github.com/zou3519	2024-08-14 14:41:01 +00:00
Alnis Murtovi	f1c439cbed	AutoHeuristic: refactoring (#133170 ) This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170 Approved by: https://github.com/Chillee	2024-08-13 01:46:53 +00:00
Alnis Murtovi	21302d5891	AutoHeuristic: script to generate data for mm (#131617 ) This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617 Approved by: https://github.com/eellison ghstack dependencies: #131615, #131616	2024-08-09 23:49:29 +00:00
Alnis Murtovi	383f2ac914	AutoHeuristic: mixed_mm H100 heuristic (#132685 ) H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic. ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 1562 604 145 2311 1.522201 1.077722 10.399141 3.134170 1.034802 2061 2 test entropy 5 0.01 361 164 24 549 1.443590 1.079169 8.159173 3.105360 1.197973 500 2 ``` gpt-fast speedups \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 109.95 \| 220.63\| 2 \| \| 1 \| 11 \| 109.65 \| 210.92\| 1.92 \| \| 4 \| 7 \| 149.04 \| 625.80\| 4.19 \| \| 4 \| 11 \| 149.56 \| 494.64\| 3.30 \| \| 8 \| 7 \| 293.68 \| 956.72\| 3.25 \| \| 8 \| 11 \| 294.48 \| 925.60\| 3.14 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685 Approved by: https://github.com/eellison	2024-08-07 23:48:01 +00:00
Alnis Murtovi	48929184e9	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison	2024-08-02 13:54:37 +00:00
cyy	b9cb1abf65	[12/N] Use std::optional (#132361 ) Follows #132396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361 Approved by: https://github.com/eqy	2024-08-02 13:46:46 +00:00
Oguz Ulgen	a6985c09cb	Add None return type to init -- functorch and torchgen (#132351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335	2024-08-01 15:26:45 +00:00
PyTorch MergeBot	a28cda11ef	Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613 )" This reverts commit `344c15a0bb`. Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))	2024-08-01 03:22:11 +00:00
Alnis Murtovi	344c15a0bb	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison ghstack dependencies: #131610, #131611	2024-08-01 02:25:54 +00:00
Alnis Murtovi	d3cefc9e3a	AutoHeuristic: Collect data for mixed_mm (#131611 ) This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things: Move pad_mm related AutoHeuristic files into subdirectory Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py). The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611 Approved by: https://github.com/eellison ghstack dependencies: #131610	2024-07-31 20:45:45 +00:00
JackCaoG	b40249b462	propagate XLA's metadata after functional sync (#131076 ) Fixes https://github.com/pytorch/xla/issues/7174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076 Approved by: https://github.com/bdhirsh	2024-07-31 18:20:00 +00:00
Yan Zhiwei	fe4f8e97cd	[Intel GPU] xpu-ops codegen via backend whitelist (#130082 ) # Motivation This PR intends to enhance the codegen to allow generate codes for XPU backend. XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts. We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes. # Usage XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops` We use the following commands to generate XPU operators ` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU` The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen. The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten` # Result All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder. # Verification * In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #130019	2024-07-31 16:31:38 +00:00
Nick Westlake	7124efa81b	Include _native.h for structured_native_functions (#131208 ) In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups". Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208 Approved by: https://github.com/bdhirsh	2024-07-24 02:55:36 +00:00
Alnis Murtovi	7f1cda1533	Autoheuristic: Do not store choices as metadata (#130304 ) While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`). In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304 Approved by: https://github.com/eellison	2024-07-18 21:39:42 +00:00
Alnis Murtovi	d818c3319f	Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245 ) Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245 Approved by: https://github.com/shunting314	2024-07-18 01:04:36 +00:00
Xuehai Pan	f6838d521a	[BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756 Approved by: https://github.com/ezyang	2024-07-17 06:44:35 +00:00
Alnis Murtovi	50ef099ad0	Learn a heuristic to decide whether to pad before mm (#128643 ) This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code. The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code. The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set. The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen. On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often. The results for randomly generated data and huggingface look as follows: `max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`. The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set. A100 ``` max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold 15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530 16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530 17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530 18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530 19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530 ``` The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned: A100 hf-train: 60 A100 hf-inf: 10 ## Results on running huggingface locally This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds. #pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure"). I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency. Results on huggingface training ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2 BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1 BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2 CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1 DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2 PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2 PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1 RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1 TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2 Mean difference in speedup: 0.01 Mean compilation latency saved: -0.80s Mean compilation latency reduction: 1.68% ``` Results on huggingface inference ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2 BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1 BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2 DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2 PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2 PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1 TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2 Mean difference in speedup: 0.00 Mean compilation latency saved: -0.38s Mean compilation latency reduction: 2.07% ``` For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-07-15 23:04:06 +00:00
cyy	7c83f5f7d5	[8/N] Replace c10::optional with std::optional (#130509 ) Follows #130510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509 Approved by: https://github.com/ezyang	2024-07-13 13:05:36 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
cyy	7a3ab1fe79	[structural binding][7/N] Replace std::tie with structural binding (#130216 ) Follows #120353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216 Approved by: https://github.com/albanD	2024-07-10 00:52:04 +00:00
cyy	71efbf701d	[3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300 ) Follows #130236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300 Approved by: https://github.com/ezyang	2024-07-09 13:32:57 +00:00
Xuehai Pan	d1d0a7080f	[torchgen] reference generated comment to actual location of the generator and template (#130020 ) As per title. ```diff # torch/_VF.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/return_types.pyi - # @generated from torch/_C/return_types.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in ``` ```diff # torch/_C/__init__.pyi - # @generated from torch/_C/__init__.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in ``` ```diff # torch/_C/_nn.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in ``` ```diff # torch/_C/_VariableFunctions.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/nn/functional.pyi + # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020 Approved by: https://github.com/ezyang	2024-07-05 21:47:14 +00:00
Xuehai Pan	735044191f	[Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884 ) The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`. `5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)` Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string. ```python ", ".join(string.split(",")) ``` ```python # before def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ... # after def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884 Approved by: https://github.com/ezyang	2024-07-03 11:45:24 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit `6d75604ef1`. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Xuehai Pan	9120992c72	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-29 09:23:39 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `9e1f3ecaa7`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
PyTorch MergeBot	6063bb9d45	Revert "[BE][Easy] enable postponed annotations in `torchgen` (#129376 )" This reverts commit `494057d6d4`. Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
Xuehai Pan	494057d6d4	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-28 15:37:57 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `b7e7a4cb01`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `0314c4c101`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Xuehai Pan	93a33bf3ac	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 18:04:38 +00:00
PyTorch MergeBot	cb4919344a	Revert "[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )" This reverts commit `e53d959028`. Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))	2024-06-24 16:18:43 +00:00
Xuehai Pan	e53d959028	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 14:35:41 +00:00
Xuehai Pan	b697808056	[BE][Easy] eliminate relative import in `torchgen` (#128872 ) Fix generated by: ```bash ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872 Approved by: https://github.com/zou3519	2024-06-21 14:11:46 +00:00
Colin Peppler	3a185778ed	[aotinductor] Add torch.polar fallback op for shim v2 (#128722 ) Compilation error: ``` $ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar /tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’? ``` Steps: 1. Add aten.polar 2. run `python torchgen/gen.py --update-aoti-c-shim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-19 05:06:58 +00:00
Aaron Orenstein	732b4e9074	Fix generated vararg types (#128648 ) In the generated files torchgen is incorrectly generating types on the varargs. The changes all look like this (changing `size: _int` to `size: Union[_int, SymInt]`): ``` --- ./torch/_VF.pyi.sav 2024-06-13 20:36:49.189664629 -0700 +++ ./torch/_VF.pyi 2024-06-13 20:36:57.208894614 -0700 @@ -168,17 +168,17 @@ @overload def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], , dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... @overload -def _efficientzerotensor(size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... +def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... @overload ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648 Approved by: https://github.com/jamesjwu	2024-06-14 16:04:37 +00:00
cyy	3f9b8446cf	[8/N] Remove unused functions (#128499 ) Follows #128407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499 Approved by: https://github.com/malfet	2024-06-13 01:15:11 +00:00
David Berard	29081059b6	[Static Runtime] Fix & run gen_static_runtime_ops (#128299 ) gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise. I added a number of ops to the blocklist: ``` + "_nested_tensor_storage_offsets", + "_nested_get_values", # no CPU backend + "_nested_get_values_copy", # no CPU backend + "_nested_view_from_jagged", # testing needs to be patched + "_nested_view_from_jagged_copy", # testing needs to be patched + "_nested_view_from_buffer", # testing needs to be patched + "_nested_view_from_buffer_copy", # testing needs to be patched + "_int_mm", # testing needs to be patched + "_to_sparse_csc", # testing needs to be patched + "_to_sparse_csr", # testing needs to be patched + "segment_reduce", # testing needs to be patched ``` Most of these are added just because testing doesn't work right now. Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though. Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299 Approved by: https://github.com/YuqingJ	2024-06-11 16:27:39 +00:00
Daniil Kutz	b506d37331	Fix multiple errors while parsing NativeFunctions from YAML (#127413 ) Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file. Add assertions that validates parsed data. Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413 Approved by: https://github.com/ezyang	2024-05-30 16:25:04 +00:00
Jane Xu	601c5e085d	Add _foreach_max (#127187 ) This PR adds _foreach_max support, the second reduction foreach op we have :D I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first. Caveats! - We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath! - MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later. - This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187 Approved by: https://github.com/albanD	2024-05-29 19:08:58 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Bin Bao	71f1aebe1f	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-24 19:10:33 +00:00
PyTorch MergeBot	47c976b904	Revert "[AOTI] Add more fallback ops (#126720 )" This reverts commit `19cd4484ec`. Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
Bin Bao	19cd4484ec	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-22 15:33:24 +00:00
Bin Bao	0332b5812e	[AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183 ) Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183 Approved by: https://github.com/angelayi ghstack dependencies: #126181, #126182	2024-05-16 17:07:06 +00:00
Bin Bao	c5f926ab87	[AOTI][torchgen] Support at::Generator via C shim (#126181 ) Summary: Support at::Generator which is used by many random number generator ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/126181 Approved by: https://github.com/chenyang78	2024-05-16 17:06:53 +00:00
Bin Bao	ee8c1550d6	[AOTI][torchgen] Add a few more fallback ops (#126013 ) Summary: They appear in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126013 Approved by: https://github.com/chenyang78 ghstack dependencies: #125962	2024-05-15 12:56:07 +00:00
Bin Bao	563aa3e035	[AOTI][torchgen] Update NativeFunctionsGroup mapping (#125962 ) Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962 Approved by: https://github.com/chenyang78	2024-05-15 12:56:07 +00:00
Aaron Gokaslan	34910f87f0	[BE]: Update ruff to v0.4.4 (#125031 ) Update ruff version to 0.4.2. This version mostly has bugfixes for the new parser and also updates the f-string rule to be able to apply more fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125031 Approved by: https://github.com/albanD, https://github.com/malfet	2024-05-12 20:02:37 +00:00
Yukio Siraichi	02093b6c6a	Keep track of `ViewMeta` with symbolic inputs. (#125876 ) Fix: #125387 This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs. In summary, the changes are: - Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor parameter - Add the field `FunctionalTensorWrapper::is_symbolic_` and the method `FunctionalTensorWrapper::maybe_mark_symbolic` - Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have symbolic inputs - Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API - Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation - Use the AOTAutograd `ViewMeta`-replay path if: - `target_functional_tensor` is not `None`; and - `target_functional_tensor` is not symbolic (instead of using a functorch config) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876 Approved by: https://github.com/ezyang	2024-05-12 01:41:06 +00:00
Bin Bao	0dda3389e5	[AOTI][torchgen] Minor improvements to C shim torchgen (#125928 ) Summary: Make some improvements to https://github.com/pytorch/pytorch/pull/125589 * Add a .default suffix to default ops in fallback_ops.py, to make it clear that those are OpOverload. * Update warnings and comments based on feedbacks to https://github.com/pytorch/pytorch/pull/125589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125928 Approved by: https://github.com/angelayi ghstack dependencies: #125291, #125730, #125731	2024-05-11 18:12:46 +00:00
Bin Bao	538877d204	[AOTI] Fix convolution_backward (#125730 ) Summary: for https://github.com/pytorch/pytorch/issues/125922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125730 Approved by: https://github.com/chenyang78 ghstack dependencies: #125291	2024-05-10 20:13:34 +00:00
Bin Bao	ed48ea9997	[AOTI] Refine the C shim autogen mechanism (#125589 ) Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run. Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang	2024-05-09 02:48:16 +00:00
Huamin Li	303880e16b	Update gen.py aoti_fm install dir (#125087 ) Summary: make it consistent with all the other install dir Test Plan: Sandcastle Differential Revision: D56660301 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125087 Approved by: https://github.com/frank-wei	2024-04-29 08:25:16 +00:00
Aaron Gokaslan	2f3b0befed	[BE]: Apply ruff FURB 118. (#124743 ) Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743 Approved by: https://github.com/albanD, https://github.com/malfet	2024-04-26 14:34:52 +00:00
Edward Z. Yang	4c44e2b236	Improved unbacked SymInt input support in Inductor (#124739 ) This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/ This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes https://github.com/pytorch/pytorch/issues/124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316, #124394	2024-04-25 13:29:53 +00:00
Ashwin Hari	5f5778476a	rename ort to maia (#123265 ) Fixes #123264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265 Approved by: https://github.com/albanD	2024-04-23 00:33:25 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
Brian Hirsh	2fe672b146	compile: ban mutations on non-compositional uses of as_strided (#122502 ) Fixes https://github.com/pytorch/pytorch/issues/104505 I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor. So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was another view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-12 01:12:23 +00:00
Aaron Orenstein	4044e93a51	Add mm_pattern and bmm_pattern to serialized_patterns (#121313 ) Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled. To precompile patterns (and save to disk) run: ``` torchgen/fuse_attention_patterns/gen_attention_patterns.py ``` - Updated the sfdp patterns to use `gen_register_replacement`. - Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added). - Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way. - Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313 Approved by: https://github.com/eellison	2024-04-09 19:42:19 +00:00
angelayi	493478db4a	[effects] Add inductor support for tokens (#122347 ) Given the following code/dynamo graph: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ _print = torch.ops.aten._print('moo') res = l_x_ + l_x_; l_x_ = None _print_1 = torch.ops.aten._print('moo') return (res,) ``` AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output: ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"): with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None return (getitem_2, add) ``` However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators. This has to be done after the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph. ``` class <lambda>(torch.nn.Module): def forward(self, arg1_1: "f32[2, 3]"): _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default() with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None return (add,) ``` When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like: ``` def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) # Source Nodes: [_print], Original ATen: [] buf2 = aten._print.default('moo') # Source Nodes: [_print_1], Original ATen: [] buf3 = aten._print.default('moo') buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, buf4) del arg1_1 return (buf4, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347 Approved by: https://github.com/bdhirsh	2024-04-09 03:22:32 +00:00
rzou	067851dd0d	Expand is_functional_schema to work with torch._C._FunctionSchema (#123108 ) Previously it worked with torchgen.model.FunctionSchema. This PR extends it to work with torch._C._FunctionSchema by making torchgen.model.FunctionSchema look more like torch._C._FunctionSchema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108 Approved by: https://github.com/albanD	2024-04-05 22:03:39 +00:00
cyy	7423092227	[TorchGen] [2/N] Remove unused variables and simplify dictionary iterations (#122585 ) This PR continues to remove unused variables and simplifies dictionary iterations from TorchGen scripts, following #122576. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122585 Approved by: https://github.com/ezyang	2024-03-29 20:34:11 +00:00
cyy	fb90b4d4b2	[TorchGen] Use std::optional in generated code (#121454 ) This PR changes TorchGen to generate std::optional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121454 Approved by: https://github.com/ezyang	2024-03-29 14:11:09 +00:00
PyTorch MergeBot	b2c496ba24	Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 )" This reverts commit `c1fe09dc37`. Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))	2024-03-25 20:14:40 +00:00
PyTorch MergeBot	db506762d1	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit `a52b4e2257`. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))	2024-03-25 18:52:05 +00:00
cyy	a01d35c7f6	[TorchGen] Remove unused variables (#122576 ) This PR removes some unused Python variables from TorchGen scripts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576 Approved by: https://github.com/Skylion007	2024-03-25 03:31:41 +00:00
cyy	c1fe09dc37	[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415 ) This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415 Approved by: https://github.com/ezyang	2024-03-24 06:11:08 +00:00
cyy	a52b4e2257	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-24 02:12:08 +00:00
PyTorch MergeBot	02fee6caec	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit `ecbe82b9ce`. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))	2024-03-22 14:53:45 +00:00
cyy	ecbe82b9ce	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-22 03:49:31 +00:00
Joel Schlosser	cd6bfc7965	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-22 02:12:36 +00:00
drisspg	4ba51bb2c4	Add keys used for templated attention impls (#122423 ) # Summary Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423 Approved by: https://github.com/bdhirsh	2024-03-21 22:16:53 +00:00
PyTorch MergeBot	224beecee6	Revert "Proper view support for jagged layout NestedTensor (#113279 )" This reverts commit `5855c490f0`. Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))	2024-03-21 22:03:01 +00:00
Joel Schlosser	5855c490f0	Proper view support for jagged layout NestedTensor (#113279 ) This PR: * Introduces an ATen op for creating true jagged views from a dense values buffer * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)` * This ops is implemented on the Python side using torch.library so we can return a subclass instance * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer` * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()` * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view * Introduces an ATen op for accessing the `values` component of an NT via a view * `_nested_get_values(nt)` * Removes the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively. * Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly * Similarly, avoid `buffer_from_jagged()`, preferring `values()` * Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack) With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling. Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922) Co-authored-by: voznesenskym <voznesenskym@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279 Approved by: https://github.com/ezyang	2024-03-20 23:45:34 +00:00
Bin Bao	46493ee9b5	[AOTI][refactor] Update tensor_converter util functions (#121743 ) Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor. Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523	2024-03-14 22:17:54 +00:00
PyTorch MergeBot	c0996866f4	Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076 )" This reverts commit `4305c64fea`. Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))	2024-03-08 20:01:03 +00:00
cyy	4305c64fea	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-07 09:52:21 +00:00
Bin Bao	bd19d6d822	[AOTI] Use torchgen to generate C shim functions (#120513 ) Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as * Use plain C data types to pass parameters * Use AtenTensorHandle to pass at::Tensor * Use pointer type to pass optional parameter * Use pointer+length to pass list * Use device_type+device_index to pass device * When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis. This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage. Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513 Approved by: https://github.com/jansel	2024-03-05 04:28:44 +00:00
Jacob Szwejbka	a7c799fb85	[executorch] Add support for method variants in aten executorch code gen (#121016 ) Summary: Title. Test Plan: The added unittest Differential Revision: D54423028 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016 Approved by: https://github.com/larryliu0820	2024-03-01 20:33:02 +00:00
Pearu Peterson	70d4d109f2	Make SparseCsr a functionality dispatch key (#120703 ) As in the title. To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703 Approved by: https://github.com/ezyang	2024-03-01 13:28:46 +00:00
angelayi	f064dec7e0	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-27 01:34:59 +00:00
PyTorch MergeBot	b01bd1f7a1	Revert "Add torch.ops.aten.print (#120295 )" This reverts commit `3b944113c8`. Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))	2024-02-27 01:18:48 +00:00
PyTorch MergeBot	8a32a07856	Revert "Add meta device support to sparse compressed tensors (#120498 )" This reverts commit `5d71ba6885`. Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))	2024-02-26 15:59:36 +00:00
Pearu Peterson	5d71ba6885	Add meta device support to sparse compressed tensors (#120498 ) As in the title. Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498 Approved by: https://github.com/ezyang	2024-02-25 16:50:17 +00:00
Aaron Gokaslan	33938cfddd	[BE][Ez] Update ruff to 0.2.2 (#120517 ) Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517 Approved by: https://github.com/albanD	2024-02-24 07:13:53 +00:00
Isuru Fernando	c3496d50f0	Fix torch.return_types init signature (#119284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284 Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan	2024-02-23 21:52:34 +00:00
angelayi	3b944113c8	Add torch.ops.aten.print (#120295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295 Approved by: https://github.com/zou3519	2024-02-23 17:01:22 +00:00
Yu, Guangye	5c46600f84	[RELAND] refactor lazy init to device-agnostic (#119248 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846). This is a common PR, and does not trigger xpu ciflow. Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman	2024-02-07 15:58:51 +00:00
PyTorch MergeBot	ab613a4019	Revert "refactor lazy init to device-agnostic (#118846 )" This reverts commit `520771d7b3`. Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11 ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))	2024-02-05 18:06:30 +00:00
Yu, Guangye	520771d7b3	refactor lazy init to device-agnostic (#118846 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846 Approved by: https://github.com/malfet	2024-02-02 12:10:39 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Masaki Kozuki	67d8db9252	Remove semicolon after `return_from_mutable_noop_redispatch` (#118538 ) [`return_from_mutable_noop_redispatch`](`65f8276bc6/torchgen/gen_functionalization_type.py (L477)`) calls [`return_str`](`65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)`). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538 Approved by: https://github.com/colesbury	2024-01-30 02:22:42 +00:00
Edward Z. Yang	119b66ba16	Use strict to toggle strict options in MYPYSTRICT (#118479 ) As we force a specific version of mypy, it's OK to use the agglomerated flag. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475	2024-01-28 19:22:22 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
albanD	24133e44b1	Fix return type hint for list types (#118238 ) All single element list types are `Tensor[]` so they will always be Tuple. I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238 Approved by: https://github.com/ezyang	2024-01-25 23:35:20 +00:00
Sam Larsen	40a6710ad3	Mark set_ as an inplace view op (#115769 ) Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them. Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake` Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769 Approved by: https://github.com/bdhirsh	2024-01-17 15:32:18 +00:00
Joel Schlosser	5aac95c713	Introduce slice_inverse() op (#117041 ) Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR: * Introduces the op itself * Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly * Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()` * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption) @albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041 Approved by: https://github.com/bdhirsh	2024-01-16 23:44:54 +00:00
Edward Z. Yang	003c900d5e	Add _assert_scalar (#117378 ) Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378 Approved by: https://github.com/jansel	2024-01-14 00:50:36 +00:00
PyTorch MergeBot	1174e82bde	Revert "Add _assert_scalar and teach Inductor to codegen it (#114148 )" This reverts commit `b6028acfa4`. Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))	2024-01-11 02:30:22 +00:00
Joel Schlosser	16d69290c6	Use view name instead of view_copy name for functional inverses (#117056 ) Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056 Approved by: https://github.com/bdhirsh	2024-01-10 00:52:36 +00:00
Edward Z. Yang	b6028acfa4	Add _assert_scalar and teach Inductor to codegen it (#114148 ) Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor. So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed. I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148 Approved by: https://github.com/jansel	2024-01-09 23:21:26 +00:00
Joel Schlosser	52f0457d7d	Support view returns for functional inverses on narrowing views (#115893 ) Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw). The following functional inverses are currently implemented scatter-style and thus never return views: * `as_strided_copy_inverse()` * `diagonal_copy_inverse()` * `expand_copy_inverse()` * `select_copy_int_inverse()` * `slice_copy_Tensor_inverse()` * `split_copy_Tensor_inverse()` * `split_with_sizes_copy_inverse()` * `unbind_copy_int_inverse()` * `unfold_copy_inverse()` We need to get actual views for the introduction of reverse view funcs coming next. Details: * Use `as_strided()` to implement actual view inverses for the above * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization * Introduce `InverseReturnMode` enum for customization of functional inverses * `AlwaysView` - always return an actual view; needed for reverse view_funcs() * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch) * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible * Make sure functionalization works as before * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise * Adds tests to ensure old behavior for above inverses in functionalization Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893 Approved by: https://github.com/bdhirsh	2023-12-21 21:39:22 +00:00
PyTorch MergeBot	497777e302	Revert "Mark set_ as an inplace view op (#115769 )" This reverts commit `cd449e260c`. Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))	2023-12-21 19:53:32 +00:00
Aaron Gokaslan	ee5d981249	[BE]: Enable RUFF PERF402 and apply fixes (#115505 ) * Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505 Approved by: https://github.com/malfet	2023-12-20 18:01:24 +00:00
Sam Larsen	cd449e260c	Mark set_ as an inplace view op (#115769 ) Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them. Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769 Approved by: https://github.com/bdhirsh	2023-12-19 23:08:05 +00:00
Tugsbayasgalan Manlaibaatar	d85314c95c	Support Predispatch functionalization (#113728 ) In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work: 1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key. 2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack. 3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup. Some missing bits after this PR: 1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way. 2. Turn off CompositeImplicitAutograd decomps 3. Functionalizing HOO Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728 Approved by: https://github.com/bdhirsh	2023-12-19 20:28:35 +00:00
cdzhan	99554112d3	[pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623 ) Fixes compilation failure in some environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623 Approved by: https://github.com/albanD	2023-12-13 00:59:01 +00:00
Jesse Cai	4471fe6c39	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 23:08:51 +00:00
PyTorch MergeBot	40a14e07ef	Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 )" This reverts commit `1e5636f791`. Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit `1e5636f791` ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))	2023-12-11 18:07:17 +00:00
Jesse Cai	1e5636f791	[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178 ) Summary: cuSPARSELt has support for different alg_id, which are set via `cusparseLTMatmulAlgSetAttribute`, in total there are 4 different alg_ids, 0 - 3. Previously we were just using the default alg_id, as from our initial experiments we found that for most shapes the default alg_id is the fastest and that they made no difference on numerical correctness, just performance. From our previous experiments the fastest alg_id seemed to differ only on small matmul shapes. danthe3rd found a performance regression when running with cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these characteristics (activations are small, weights are large). However it's likely that this is due to the alg_id ordering changing, as mentioned in the release notes for v0.5.0. ``` cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0. ``` This PR adds in the following: - support for passing in alg_id to _cslt_sparse_mm - a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for a given matmul _cslt_sparse_mm_search has the same function signature as _cslt_sparse_mm, minus the alg_id parameter. We are able to achieve v0.4.0 performance with alg_id=1 on the shapes that daniel provided. We will address autoselecting the best alg_id in a future PR, possibly with torch.compile. Test Plan: ``` python test/test_sparse_semi_structured -k cslt ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178 Approved by: https://github.com/cpuhrsch	2023-12-11 15:47:28 +00:00
Mengwei Liu	898554a3a3	[torchgen] Add logic in custom ops to return empty tensor (#114143 ) Summary: Add two logic: 1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor. 2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors. Test Plan: Rely on new unit tests Differential Revision: D51471651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143 Approved by: https://github.com/cccclai	2023-12-08 17:03:44 +00:00
voznesenskym	ddf1cb7870	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-28 19:33:35 +00:00
Tarun Karuturi	39f16c221e	Adding event_tracer evalue logging calls in codegen (#114584 ) Summary: This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API. When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead. Test Plan: CI Reviewed By: larryliu0820 Differential Revision: D51534590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584 Approved by: https://github.com/larryliu0820	2023-11-28 18:32:05 +00:00
Nikita Shulga	7c98bac4a0	[BE] Speedup register schema compilation (#114438 ) For some reason, inlining initializer list into a std::vector takes a lot of time using clang-15. But considering that there are only dozen or so distrinct tags, creating them once and pass as def argument should not affect runtime speed at all, but this significantly improves compilation time. On Mac M1 it reduces time needed to compiler RegisterSchema.cpp from 50 to 3 seconds. Special case empty tags, to keep torch_gen tests happy Before ``` % /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 131.8054 seconds (132.5540 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- ---Instr--- --- Name --- 43.6364 ( 33.2%) 0.0919 ( 30.1%) 43.7282 ( 33.2%) 43.9658 ( 33.2%) 536345245380 ModuleInlinerWrapperPass 43.6291 ( 33.2%) 0.0891 ( 29.2%) 43.7182 ( 33.2%) 43.9549 ( 33.2%) 536264096394 DevirtSCCRepeatedPass 42.3766 ( 32.2%) 0.0185 ( 6.1%) 42.3951 ( 32.2%) 42.6198 ( 32.2%) 523040901767 GVNPass 0.4085 ( 0.3%) 0.0040 ( 1.3%) 0.4125 ( 0.3%) 0.4195 ( 0.3%) 4106085945 SimplifyCFGPass 0.3611 ( 0.3%) 0.0115 ( 3.8%) 0.3726 ( 0.3%) 0.3779 ( 0.3%) 4864696407 InstCombinePass 0.1607 ( 0.1%) 0.0088 ( 2.9%) 0.1695 ( 0.1%) 0.1720 ( 0.1%) 1780986175 InlinerPass 0.0865 ( 0.1%) 0.0024 ( 0.8%) 0.0889 ( 0.1%) 0.0914 ( 0.1%) 1489982961 SROAPass 0.0750 ( 0.1%) 0.0013 ( 0.4%) 0.0763 ( 0.1%) 0.0764 ( 0.1%) 620016338 SCCPPass 0.0661 ( 0.1%) 0.0040 ( 1.3%) 0.0701 ( 0.1%) 0.0735 ( 0.1%) 592027163 EarlyCSEPass ... ===-------------------------------------------------------------------------=== Clang front-end time report ===-------------------------------------------------------------------------=== Total Execution Time: 48.2802 seconds (48.8638 wall clock) ... ``` After ``` % /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 1.2920 seconds (1.3187 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- ---Instr--- --- Name --- 0.3070 ( 27.6%) 0.0547 ( 30.2%) 0.3617 ( 28.0%) 0.3654 ( 27.7%) 3719690895 ModuleInlinerWrapperPass 0.3024 ( 27.2%) 0.0525 ( 29.0%) 0.3549 ( 27.5%) 0.3585 ( 27.2%) 3653363330 DevirtSCCRepeatedPass 0.0619 ( 5.6%) 0.0073 ( 4.0%) 0.0692 ( 5.4%) 0.0711 ( 5.4%) 868136227 InstCombinePass 0.0601 ( 5.4%) 0.0065 ( 3.6%) 0.0666 ( 5.2%) 0.0679 ( 5.1%) 696430647 InlinerPass 0.0363 ( 3.3%) 0.0033 ( 1.8%) 0.0396 ( 3.1%) 0.0425 ( 3.2%) 535426974 SimplifyCFGPass 0.0280 ( 2.5%) 0.0069 ( 3.8%) 0.0348 ( 2.7%) 0.0358 ( 2.7%) 378716394 BlockFrequencyAnalysis 0.0208 ( 1.9%) 0.0049 ( 2.7%) 0.0257 ( 2.0%) 0.0262 ( 2.0%) 283689627 BranchProbabilityAnalysis 0.0239 ( 2.1%) 0.0002 ( 0.1%) 0.0241 ( 1.9%) 0.0241 ( 1.8%) 219122704 OpenMPOptCGSCCPass 0.0174 ( 1.6%) 0.0015 ( 0.8%) 0.0189 ( 1.5%) 0.0192 ( 1.5%) 215583965 GVNPass 0.0153 ( 1.4%) 0.0025 ( 1.4%) 0.0178 ( 1.4%) 0.0187 ( 1.4%) 184232295 EarlyCSEPass ... ===-------------------------------------------------------------------------=== Clang front-end time report ===-------------------------------------------------------------------------=== Total Execution Time: 2.9128 seconds (3.1027 wall clock) ... ``` And the generated schema file looks as follows: ```cpp TORCH_LIBRARY(aten, m) { const std::vector<at::Tag> tags_0 = {at::Tag::pt2_compliant_tag}; m.def("_cast_Byte(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Char(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Double(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Float(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Int(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Long(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Short(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_cast_Half(Tensor self, bool non_blocking=False) -> Tensor", tags_0); m.def("_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()", tags_0); m.def("set_data(Tensor(a!) self, Tensor new_data) -> ()", tags_0); m.def("data(Tensor self) -> Tensor", tags_0); m.def("is_leaf(Tensor self) -> bool", tags_0); m.def("output_nr(Tensor self) -> int", tags_0); m.def("_version(Tensor self) -> int", tags_0); m.def("requires_grad_(Tensor(a!) self, bool requires_grad=True) -> Tensor(a!)", tags_0); m.def("retain_grad(Tensor(a!) self) -> ()", tags_0); m.def("retains_grad(Tensor self) -> bool", tags_0); m.def("_fw_primal(Tensor(a) self, int level) -> Tensor(a)", tags_0); m.def("_make_dual(Tensor(a) primal, Tensor tangent, int level) -> Tensor(a)", tags_0); m.def("_unpack_dual(Tensor(a) dual, int level) -> (Tensor(a) primal, Tensor tangent)", tags_0); m.def("_new_zeros_with_same_feature_meta(Tensor self, Tensor other, *, int self_num_batch_dims=0) -> Tensor", tags_0); m.def("_has_same_storage_numel(Tensor self, Tensor other) -> bool", tags_0); const std::vector<at::Tag> tags_1 = {at::Tag::inplace_view, at::Tag::pt2_compliant_tag}; m.def("rename_(Tensor(a!) self, Dimname[]? names) -> Tensor(a!)", tags_1); m.def("rename(Tensor(a) self, Dimname[]? names) -> Tensor(a)", tags_0); m.def("align_to(Tensor(a) self, Dimname[] names) -> Tensor(a)", tags_0); m.def("align_to.ellipsis_idx(Tensor(a) self, Dimname[] order, int ellipsis_idx) -> Tensor(a)", tags_0); m.def("align_as(Tensor self, Tensor other) -> Tensor", tags_0); ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114438 Approved by: https://github.com/zou3519	2023-11-27 23:33:04 +00:00
PyTorch MergeBot	3e1abde46d	Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 )" This reverts commit `a911b4db9d`. Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))	2023-11-22 10:13:48 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
voznesenskym	a911b4db9d	AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554 ) This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are: (1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break) (2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call. (3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`. (4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same). I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation This PR is still silently correct in one case though, which I'd like to discuss more. In particular, this example: ``` def f(x): x_view = x.view(-1) x.set_(torch.ones(2)) x_view.mul_(2) return ``` If you have an input that experiences both a data-mutation and a `x_old.set_(x_new)` call, there are two cases: (a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input (b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like: ``` def functionalized_f(x): x_view = x.view(-1) # set_() desugars into a no-op; later usages of x will use x_output x_output = torch.ones(2) # functionalize the mutation on x_view x_view_updated = x.mul(2) x_updated = x_view_updated.view(x.shape) # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation # We need to return both updated tensors in our graph return x_updated, x_output def runtime_wrapper(x): x_data_mutation_result, x_set_mutation_result = compiled_graph(x) # First, perform the data mutation on x's old storage x.copy_(x_data_mutation_result) # Then, swap out the storage of x with the new storage x.set_(x_set_mutation_result) ``` There are two things that make this difficult to do though: (1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated. (2) AOTAutograd now needs to know that we might have two graph outputs that correspond to a single "mutated input", which is annoying. It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554 Approved by: https://github.com/ezyang ghstack dependencies: #113926	2023-11-21 01:52:46 +00:00
Edward Z. Yang	8c4812be80	Replace expect_int with guard_int (#113921 ) The idea is that instead of erroring, we will just specialize at these sites. Fixes https://github.com/pytorch/pytorch/issues/113142 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921 Approved by: https://github.com/zou3519	2023-11-20 21:27:48 +00:00
Brian Vaughan	dbb96ef30d	improve annotation device parameters where a device ordinal is allowed (#113647 ) Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal. `error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str \| device" [arg-type]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647 Approved by: https://github.com/albanD	2023-11-17 14:41:22 +00:00
Jane Xu	deec2380c7	Add 0dim Tensor overload for _foreach_div (#113688 ) This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688 Approved by: https://github.com/mlazos, https://github.com/albanD	2023-11-15 20:59:32 +00:00
George White	6c187246d6	Add support for float8_e4m3fnuz and _e5m2fnuz (#107586 ) This PR relates to the feature in [this feature submission](https://docs.google.com/document/d/1pF2T1xz54IPg1jG7FhykbrpbcJZVelQw0v8vBaoLkfs/edit). It has been based on #104242 which adds similar float8 types. These new types added in this PR are described in the paper at https://arxiv.org/abs/2206.02915. A brief description and comparison of the types with other float8 types can be also found in the [OpenXLA RFC](https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107586 Approved by: https://github.com/seemethere, https://github.com/malfet	2023-11-15 15:01:11 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `54493fe8c4`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Aaron Gokaslan	18d7b8e4f7	[BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644 ) Reopens #111682 that I messed up due to a bad rebase and triggered some issues with CLA. This explicitly adds check=True or False to any subprocess calls where appropriate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113644 Approved by: https://github.com/ezyang, https://github.com/kit1980	2023-11-14 20:59:40 +00:00

1 2 3 4 5 ...

611 Commits