Commit Graph

611 Commits

Author SHA1 Message Date
cyy
1595e755af [Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549)
Reland of #121415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549
Approved by: https://github.com/ezyang
2024-09-01 15:15:38 +00:00
Manuel Candales
caa04e0cae [ET] codegen: bool array as array ref (#134886)
Test Plan: CI

Differential Revision: D62046959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886
Approved by: https://github.com/larryliu0820
2024-09-01 01:33:43 +00:00
Manuel Candales
cae817c862 [ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245)
Summary:
Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations

These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h)

Test Plan: CI

Differential Revision: D61490943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245
Approved by: https://github.com/larryliu0820
2024-08-28 19:58:37 +00:00
chilli
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
Yidi Wu
6835f20d20 [HOP] support generating schema for hop (#133521)
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.

We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
2024-08-21 17:34:21 +00:00
Alnis Murtovi
8b8b4e5ae9 AutoHeuristic: documentation for mm (#133611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714, #133608
2024-08-16 16:20:38 +00:00
Alnis Murtovi
0e0077f3b6 AutoHeuristic: mm ranking heuristic h100 (#133608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714
2024-08-16 16:20:38 +00:00
Alnis Murtovi
e51c8ad369 AutoHeuristic: Heuristic that ranks choices for mm (#131714)
This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included.

Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710
2024-08-16 16:20:38 +00:00
Alnis Murtovi
add0f0085c AutoHeuristic: Support ranking/pruning choices (#131705)
This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705
Approved by: https://github.com/eellison
2024-08-16 01:20:52 +00:00
Alnis Murtovi
5dfb22d4c8 AutoHeuristic: tests (#133496)
This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496
Approved by: https://github.com/eellison
2024-08-15 19:22:44 +00:00
Alnis Murtovi
9876aa39c0 AutoHeuristic: pad_mm documentation (#133411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411
Approved by: https://github.com/Chillee
ghstack dependencies: #133409, #133410
2024-08-15 10:49:56 +00:00
Alnis Murtovi
f32a9e953f AutoHeuristic: mixed_mm documentation (#133410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410
Approved by: https://github.com/Chillee
ghstack dependencies: #133409
2024-08-15 10:49:56 +00:00
Alnis Murtovi
142353eca3 AutoHeuristic: util scripts (#133409)
This PR introduces scripts that make it easier to use autoheuristic:
- `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU.
- `merge_data.py`: This script can be used to merge multiple training data files into a single file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409
Approved by: https://github.com/Chillee
2024-08-15 10:49:56 +00:00
Alnis Murtovi
448d54ee92 AutoHeuristic: instructions (#132894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894
Approved by: https://github.com/Chillee
2024-08-15 04:54:54 +00:00
Yuanhao Ji
378b12f3ad Improve namespace for c10::MemoryFormat::Contiguous in torchgen/api/cpp.py (#131622)
Top-level namespaces are more convenient for out-of-tree device extensions.

For example, now we have a patch for it in `torch_npu`:

98c50ced16/codegen/gen_backend_stubs.py (L772-L778)

```python
JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622
Approved by: https://github.com/zou3519
2024-08-14 14:41:01 +00:00
Alnis Murtovi
f1c439cbed AutoHeuristic: refactoring (#133170)
This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170
Approved by: https://github.com/Chillee
2024-08-13 01:46:53 +00:00
Alnis Murtovi
21302d5891 AutoHeuristic: script to generate data for mm (#131617)
This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617
Approved by: https://github.com/eellison
ghstack dependencies: #131615, #131616
2024-08-09 23:49:29 +00:00
Alnis Murtovi
383f2ac914 AutoHeuristic: mixed_mm H100 heuristic (#132685)
H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic.
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup  max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     1562    604     145   2311         1.522201          1.077722          10.399141            3.134170              1.034802               2061               2
 test  entropy          5              0.01      361    164      24    549         1.443590          1.079169           8.159173            3.105360              1.197973                500               2
```

gpt-fast speedups
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      |      109.95  |       220.63|  2      |
|     1    |     11      |      109.65  | 	    210.92|  1.92   |
|     4    |      7      |       149.04 |       625.80|  4.19   |
|     4    |     11      |       149.56 |       494.64|  3.30   |
|     8    |      7      |       293.68 |       956.72|  3.25   |
|     8    |     11      |       294.48 |       925.60|  3.14   |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685
Approved by: https://github.com/eellison
2024-08-07 23:48:01 +00:00
Alnis Murtovi
48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00
cyy
b9cb1abf65 [12/N] Use std::optional (#132361)
Follows #132396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361
Approved by: https://github.com/eqy
2024-08-02 13:46:46 +00:00
Oguz Ulgen
a6985c09cb Add None return type to init -- functorch and torchgen (#132351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335
2024-08-01 15:26:45 +00:00
PyTorch MergeBot
a28cda11ef Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613)"
This reverts commit 344c15a0bb.

Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))
2024-08-01 03:22:11 +00:00
Alnis Murtovi
344c15a0bb AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
2024-08-01 02:25:54 +00:00
Alnis Murtovi
d3cefc9e3a AutoHeuristic: Collect data for mixed_mm (#131611)
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:

Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
2024-07-31 20:45:45 +00:00
JackCaoG
b40249b462 propagate XLA's metadata after functional sync (#131076)
Fixes https://github.com/pytorch/xla/issues/7174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076
Approved by: https://github.com/bdhirsh
2024-07-31 18:20:00 +00:00
Yan Zhiwei
fe4f8e97cd [Intel GPU] xpu-ops codegen via backend whitelist (#130082)
# Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels).  Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.

# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`

We use the following commands to generate XPU operators

` python -m torchgen.gen --source-path path/to/yaml/of/xpu   --install-dir  build/xpu    --per-operator-headers    --static-dispatch-backend     --backend-whitelist=XPU`

The diff lies at `backend-whitelist=XPU`.  The backend-whitelist key is an existent argument in torchgen.

The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`

# Result

All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend.  XPU operators only include headers in this folder.

# Verification

* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
2024-07-31 16:31:38 +00:00
Nick Westlake
7124efa81b Include _native.h for structured_native_functions (#131208)
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so

This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
2024-07-24 02:55:36 +00:00
Alnis Murtovi
7f1cda1533 Autoheuristic: Do not store choices as metadata (#130304)
While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`).

In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304
Approved by: https://github.com/eellison
2024-07-18 21:39:42 +00:00
Alnis Murtovi
d818c3319f Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245)
Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245
Approved by: https://github.com/shunting314
2024-07-18 01:04:36 +00:00
Xuehai Pan
f6838d521a [BE][Easy][5/19] enforce style for empty lines in import segments in tools/ and torchgen/ (#129756)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756
Approved by: https://github.com/ezyang
2024-07-17 06:44:35 +00:00
Alnis Murtovi
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
cyy
7c83f5f7d5 [8/N] Replace c10::optional with std::optional (#130509)
Follows #130510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509
Approved by: https://github.com/ezyang
2024-07-13 13:05:36 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
cyy
7a3ab1fe79 [structural binding][7/N] Replace std::tie with structural binding (#130216)
Follows #120353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216
Approved by: https://github.com/albanD
2024-07-10 00:52:04 +00:00
cyy
71efbf701d [3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300)
Follows #130236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300
Approved by: https://github.com/ezyang
2024-07-09 13:32:57 +00:00
Xuehai Pan
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
Xuehai Pan
735044191f [Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884)
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.

5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)

Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.

```python
", ".join(string.split(","))
```

```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
2024-07-03 11:45:24 +00:00
eqy
f845a7a91a [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-30 19:22:16 +00:00
Xuehai Pan
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
PyTorch MergeBot
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef1.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
Xuehai Pan
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
Xuehai Pan
9120992c72 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-29 09:23:39 +00:00
PyTorch MergeBot
3d96217891 Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 9e1f3ecaa7.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))
2024-06-29 00:47:15 +00:00
PyTorch MergeBot
6063bb9d45 Revert "[BE][Easy] enable postponed annotations in torchgen (#129376)"
This reverts commit 494057d6d4.

Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
Xuehai Pan
494057d6d4 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-28 15:37:57 +00:00
PyTorch MergeBot
999eec8dea Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit b7e7a4cb01.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))
2024-06-28 06:03:54 +00:00
Xuehai Pan
9e1f3ecaa7 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-28 00:35:15 +00:00
PyTorch MergeBot
895316119d Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 0314c4c101.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))
2024-06-26 19:03:57 +00:00
Eddie Yan
b7e7a4cb01 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-26 00:49:18 +00:00
Xuehai Pan
0314c4c101 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-25 08:28:38 +00:00
Xuehai Pan
93a33bf3ac [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 18:04:38 +00:00
PyTorch MergeBot
cb4919344a Revert "[BE] update type annotations for basic utilities in torch/__init__.py (#129001)"
This reverts commit e53d959028.

Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))
2024-06-24 16:18:43 +00:00
Xuehai Pan
e53d959028 [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 14:35:41 +00:00
Xuehai Pan
b697808056 [BE][Easy] eliminate relative import in torchgen (#128872)
Fix generated by:

```bash
ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872
Approved by: https://github.com/zou3519
2024-06-21 14:11:46 +00:00
Colin Peppler
3a185778ed [aotinductor] Add torch.polar fallback op for shim v2 (#128722)
Compilation error:
```
$ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar

/tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’?
```

Steps:
1. Add aten.polar
2. run `python torchgen/gen.py --update-aoti-c-shim`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-19 05:06:58 +00:00
Aaron Orenstein
732b4e9074 Fix generated vararg types (#128648)
In the generated files torchgen is incorrectly generating types on the varargs.

The changes all look like this (changing `*size: _int` to `*size: Union[_int, SymInt]`):
```
--- ./torch/_VF.pyi.sav	2024-06-13 20:36:49.189664629 -0700
+++ ./torch/_VF.pyi	2024-06-13 20:36:57.208894614 -0700
@@ -168,17 +168,17 @@
 @overload
 def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], *, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 @overload
-def _efficientzerotensor(*size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
+def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 @overload
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648
Approved by: https://github.com/jamesjwu
2024-06-14 16:04:37 +00:00
cyy
3f9b8446cf [8/N] Remove unused functions (#128499)
Follows #128407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499
Approved by: https://github.com/malfet
2024-06-13 01:15:11 +00:00
David Berard
29081059b6 [Static Runtime] Fix & run gen_static_runtime_ops (#128299)
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.

I added a number of ops to the blocklist:
```
+        "_nested_tensor_storage_offsets",
+        "_nested_get_values",  # no CPU backend
+        "_nested_get_values_copy",  # no CPU backend
+        "_nested_view_from_jagged",  # testing needs to be patched
+        "_nested_view_from_jagged_copy",  # testing needs to be patched
+        "_nested_view_from_buffer",  # testing needs to be patched
+        "_nested_view_from_buffer_copy",  # testing needs to be patched
+        "_int_mm",  # testing needs to be patched
+        "_to_sparse_csc",  # testing needs to be patched
+        "_to_sparse_csr",  # testing needs to be patched
+        "segment_reduce",  # testing needs to be patched
```

Most of these are added just because testing doesn't work right now.

Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.

Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
2024-06-11 16:27:39 +00:00
Daniil Kutz
b506d37331 Fix multiple errors while parsing NativeFunctions from YAML (#127413)
Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file.

Add assertions that validates parsed data.

Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413
Approved by: https://github.com/ezyang
2024-05-30 16:25:04 +00:00
Jane Xu
601c5e085d Add _foreach_max (#127187)
This PR adds _foreach_max support, the second reduction foreach op we have :D

I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.

Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
Xuehai Pan
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
Bin Bao
71f1aebe1f [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-24 19:10:33 +00:00
PyTorch MergeBot
47c976b904 Revert "[AOTI] Add more fallback ops (#126720)"
This reverts commit 19cd4484ec.

Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
Bin Bao
19cd4484ec [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-22 15:33:24 +00:00
Bin Bao
0332b5812e [AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183)
Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183
Approved by: https://github.com/angelayi
ghstack dependencies: #126181, #126182
2024-05-16 17:07:06 +00:00
Bin Bao
c5f926ab87 [AOTI][torchgen] Support at::Generator via C shim (#126181)
Summary: Support at::Generator which is used by many random number generator ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126181
Approved by: https://github.com/chenyang78
2024-05-16 17:06:53 +00:00
Bin Bao
ee8c1550d6 [AOTI][torchgen] Add a few more fallback ops (#126013)
Summary: They appear in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126013
Approved by: https://github.com/chenyang78
ghstack dependencies: #125962
2024-05-15 12:56:07 +00:00
Bin Bao
563aa3e035 [AOTI][torchgen] Update NativeFunctionsGroup mapping (#125962)
Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962
Approved by: https://github.com/chenyang78
2024-05-15 12:56:07 +00:00
Aaron Gokaslan
34910f87f0 [BE]: Update ruff to v0.4.4 (#125031)
Update ruff version to 0.4.2. This version mostly has bugfixes for the new parser and also updates the f-string rule to be able to apply more fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125031
Approved by: https://github.com/albanD, https://github.com/malfet
2024-05-12 20:02:37 +00:00
Yukio Siraichi
02093b6c6a Keep track of ViewMeta with symbolic inputs. (#125876)
Fix: #125387

This PR helps keep track of whether an instantiated `ViewMeta` has symbolic values as
input or not. This is used for checking whether we use the AOTAutograd `ViewMeta`-replay
execution path, e.g. it doesn't support tensors that have `ViewMeta` with symbolic inputs.

In summary, the changes are:

- Add the field `ViewMeta::has_symbolic_inputs` and make it a required constructor
parameter
- Add the field `FunctionalTensorWrapper::is_symbolic_` and the method
`FunctionalTensorWrapper::maybe_mark_symbolic`
    - Marks a `FunctionalTensorWrapper` as symbolic iff any of its `ViewMeta` have
    symbolic inputs
- Add the plumbing of `FunctionalTensorWrapper::is_symbolic` to the Python API
- Codegen the computation of `ViewMeta::has_symbolic_inputs` for each view operation
- Use the AOTAutograd `ViewMeta`-replay path if:
    - `target_functional_tensor` is not `None`; and
    - `target_functional_tensor` is not symbolic (instead of using a functorch config)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125876
Approved by: https://github.com/ezyang
2024-05-12 01:41:06 +00:00
Bin Bao
0dda3389e5 [AOTI][torchgen] Minor improvements to C shim torchgen (#125928)
Summary: Make some improvements to https://github.com/pytorch/pytorch/pull/125589
* Add a .default suffix to default ops in fallback_ops.py, to make it clear that those are OpOverload.
* Update warnings and comments based on feedbacks to https://github.com/pytorch/pytorch/pull/125589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125928
Approved by: https://github.com/angelayi
ghstack dependencies: #125291, #125730, #125731
2024-05-11 18:12:46 +00:00
Bin Bao
538877d204 [AOTI] Fix convolution_backward (#125730)
Summary: for https://github.com/pytorch/pytorch/issues/125922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125730
Approved by: https://github.com/chenyang78
ghstack dependencies: #125291
2024-05-10 20:13:34 +00:00
Bin Bao
ed48ea9997 [AOTI] Refine the C shim autogen mechanism (#125589)
Summary: Based on the discussions in https://github.com/pytorch/pytorch/pull/120513. Instead of auto-generate C shim fallback ops for thousands of ops, we maintain a list of fallback ops based on torch/_inductor/lowering.py, and only generate C shim functions for those ops. At the torchgen time, we will re-generate C shim files and compare the header file contents against the existing C shim headers. If there is any change, the compilation will fail with prompt on how to proceed. This makes sure the ABI-compatible C shim layer is small enough to maintain in the long run.

Differential Revision: [D57004046](https://our.internmc.facebook.com/intern/diff/D57004046)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125589
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/albanD, https://github.com/ezyang
2024-05-09 02:48:16 +00:00
Huamin Li
303880e16b Update gen.py aoti_fm install dir (#125087)
Summary: make it consistent with all the other install dir

Test Plan: Sandcastle

Differential Revision: D56660301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125087
Approved by: https://github.com/frank-wei
2024-04-29 08:25:16 +00:00
Aaron Gokaslan
2f3b0befed [BE]: Apply ruff FURB 118. (#124743)
Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-26 14:34:52 +00:00
Edward Z. Yang
4c44e2b236 Improved unbacked SymInt input support in Inductor (#124739)
This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/

This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed:

* When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols)
* Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work
* Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt

Fixes https://github.com/pytorch/pytorch/issues/124652

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316, #124394
2024-04-25 13:29:53 +00:00
Ashwin Hari
5f5778476a rename ort to maia (#123265)
Fixes #123264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265
Approved by: https://github.com/albanD
2024-04-23 00:33:25 +00:00
Aaron Gokaslan
c5fafe9f48 [BE]: TRY002 - Ban raising vanilla exceptions (#124570)
Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR.

I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570
Approved by: https://github.com/ezyang
2024-04-21 22:26:40 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Aaron Gokaslan
1d6c5972c1 [BE]: Optimize min/max/sum comprehensions C419 (#123960)
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
Brian Hirsh
2fe672b146 compile: ban mutations on non-compositional uses of as_strided (#122502)
Fixes https://github.com/pytorch/pytorch/issues/104505

I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor.

So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was **another** view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-04-12 01:12:23 +00:00
Aaron Orenstein
4044e93a51 Add mm_pattern and bmm_pattern to serialized_patterns (#121313)
Make it easier to serialize patterns by adding `pattern_matcher.gen_register_replacement()` which is like `pattern_matcher.register_replacement()` but also requires the replacement to be precompiled.

To precompile patterns (and save to disk) run:
```
torchgen/fuse_attention_patterns/gen_attention_patterns.py
```

- Updated the sfdp patterns to use `gen_register_replacement`.
- Add serialized patterns for mm_pattern and bmm_pattern (The 'misc' patterns don't serialize cleanly so can't be added).
- Updated the testing so it checked the round-trip patterns match and not just that it serialized the same way.
- Checking that the patterns round-trip properly found that the `users` field wasn't being serialized properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121313
Approved by: https://github.com/eellison
2024-04-09 19:42:19 +00:00
angelayi
493478db4a [effects] Add inductor support for tokens (#122347)
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_
        _print = torch.ops.aten._print('moo')
        res = l_x_ + l_x_;  l_x_ = None
        _print_1 = torch.ops.aten._print('moo')
        return (res,)
```

AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
        with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo');  arg0_1 = None
        getitem: "f32[0]" = with_effects[0];  with_effects = None
        add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
        with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
        getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
        return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
   def forward(self, arg1_1: "f32[2, 3]"):
       _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
       with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo');  _make_dep_token_default = None
       getitem: "f32[0]" = with_effects[0];  with_effects = None
       add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
       with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
       getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
       _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,));  getitem_2 = None
       return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    # Source Nodes: [_print], Original ATen: []
    buf2 = aten._print.default('moo')
    # Source Nodes: [_print_1], Original ATen: []
    buf3 = aten._print.default('moo')
    buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, buf4)
    del arg1_1
    return (buf4, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
2024-04-09 03:22:32 +00:00
rzou
067851dd0d Expand is_functional_schema to work with torch._C._FunctionSchema (#123108)
Previously it worked with torchgen.model.FunctionSchema. This PR extends
it to work with torch._C._FunctionSchema by making
torchgen.model.FunctionSchema look more like torch._C._FunctionSchema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108
Approved by: https://github.com/albanD
2024-04-05 22:03:39 +00:00
cyy
7423092227 [TorchGen] [2/N] Remove unused variables and simplify dictionary iterations (#122585)
This PR continues to remove unused variables and simplifies dictionary iterations from TorchGen scripts, following #122576.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122585
Approved by: https://github.com/ezyang
2024-03-29 20:34:11 +00:00
cyy
fb90b4d4b2 [TorchGen] Use std::optional in generated code (#121454)
This PR changes TorchGen to generate std::optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121454
Approved by: https://github.com/ezyang
2024-03-29 14:11:09 +00:00
PyTorch MergeBot
b2c496ba24 Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)"
This reverts commit c1fe09dc37.

Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))
2024-03-25 20:14:40 +00:00
PyTorch MergeBot
db506762d1 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit a52b4e2257.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))
2024-03-25 18:52:05 +00:00
cyy
a01d35c7f6 [TorchGen] Remove unused variables (#122576)
This PR removes some unused Python variables from TorchGen scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576
Approved by: https://github.com/Skylion007
2024-03-25 03:31:41 +00:00
cyy
c1fe09dc37 [TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)
This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into  ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415
Approved by: https://github.com/ezyang
2024-03-24 06:11:08 +00:00
cyy
a52b4e2257 Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-24 02:12:08 +00:00
PyTorch MergeBot
02fee6caec Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit ecbe82b9ce.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))
2024-03-22 14:53:45 +00:00
cyy
ecbe82b9ce Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-22 03:49:31 +00:00
Joel Schlosser
cd6bfc7965 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-22 02:12:36 +00:00
drisspg
4ba51bb2c4 Add keys used for templated attention impls (#122423)
# Summary

Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423
Approved by: https://github.com/bdhirsh
2024-03-21 22:16:53 +00:00
PyTorch MergeBot
224beecee6 Revert "Proper view support for jagged layout NestedTensor (#113279)"
This reverts commit 5855c490f0.

Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))
2024-03-21 22:03:01 +00:00
Joel Schlosser
5855c490f0 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-20 23:45:34 +00:00
Bin Bao
46493ee9b5 [AOTI][refactor] Update tensor_converter util functions (#121743)
Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor.

Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523
2024-03-14 22:17:54 +00:00
PyTorch MergeBot
c0996866f4 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit 4305c64fea.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))
2024-03-08 20:01:03 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
Bin Bao
bd19d6d822 [AOTI] Use torchgen to generate C shim functions (#120513)
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as

* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values

https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.

This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.

Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
2024-03-05 04:28:44 +00:00
Jacob Szwejbka
a7c799fb85 [executorch] Add support for method variants in aten executorch code gen (#121016)
Summary: Title.

Test Plan: The added unittest

Differential Revision: D54423028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016
Approved by: https://github.com/larryliu0820
2024-03-01 20:33:02 +00:00
Pearu Peterson
70d4d109f2 Make SparseCsr a functionality dispatch key (#120703)
As in the title.

To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703
Approved by: https://github.com/ezyang
2024-03-01 13:28:46 +00:00
angelayi
f064dec7e0 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-27 01:34:59 +00:00
PyTorch MergeBot
b01bd1f7a1 Revert "Add torch.ops.aten.print (#120295)"
This reverts commit 3b944113c8.

Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))
2024-02-27 01:18:48 +00:00
PyTorch MergeBot
8a32a07856 Revert "Add meta device support to sparse compressed tensors (#120498)"
This reverts commit 5d71ba6885.

Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))
2024-02-26 15:59:36 +00:00
Pearu Peterson
5d71ba6885 Add meta device support to sparse compressed tensors (#120498)
As in the title.

Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498
Approved by: https://github.com/ezyang
2024-02-25 16:50:17 +00:00
Aaron Gokaslan
33938cfddd [BE][Ez] Update ruff to 0.2.2 (#120517)
Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517
Approved by: https://github.com/albanD
2024-02-24 07:13:53 +00:00
Isuru Fernando
c3496d50f0 Fix torch.return_types init signature (#119284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284
Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan
2024-02-23 21:52:34 +00:00
angelayi
3b944113c8 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-23 17:01:22 +00:00
Yu, Guangye
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
PyTorch MergeBot
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b3.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
Yu, Guangye
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Masaki Kozuki
67d8db9252 Remove semicolon after return_from_mutable_noop_redispatch (#118538)
[`return_from_mutable_noop_redispatch`](65f8276bc6/torchgen/gen_functionalization_type.py (L477)) calls
[`return_str`](65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538
Approved by: https://github.com/colesbury
2024-01-30 02:22:42 +00:00
Edward Z. Yang
119b66ba16 Use strict to toggle strict options in MYPYSTRICT (#118479)
As we force a specific version of mypy, it's OK to use the agglomerated flag.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475
2024-01-28 19:22:22 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
albanD
24133e44b1 Fix return type hint for list types (#118238)
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
2024-01-25 23:35:20 +00:00
Sam Larsen
40a6710ad3 Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`

Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2024-01-17 15:32:18 +00:00
Joel Schlosser
5aac95c713 Introduce slice_inverse() op (#117041)
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
    * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)

@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
2024-01-16 23:44:54 +00:00
Edward Z. Yang
003c900d5e Add _assert_scalar (#117378)
Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378
Approved by: https://github.com/jansel
2024-01-14 00:50:36 +00:00
PyTorch MergeBot
1174e82bde Revert "Add _assert_scalar and teach Inductor to codegen it (#114148)"
This reverts commit b6028acfa4.

Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))
2024-01-11 02:30:22 +00:00
Joel Schlosser
16d69290c6 Use view name instead of view_copy name for functional inverses (#117056)
Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056
Approved by: https://github.com/bdhirsh
2024-01-10 00:52:36 +00:00
Edward Z. Yang
b6028acfa4 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
2024-01-09 23:21:26 +00:00
Joel Schlosser
52f0457d7d Support view returns for functional inverses on narrowing views (#115893)
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`

We need to get actual views for the introduction of reverse view funcs coming next.

Details:
* Use `as_strided()` to implement actual view inverses for the above
    * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
    * `AlwaysView` - always return an actual view; needed for reverse view_funcs()
    * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
    * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
    * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
    * Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
2023-12-21 21:39:22 +00:00
PyTorch MergeBot
497777e302 Revert "Mark set_ as an inplace view op (#115769)"
This reverts commit cd449e260c.

Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))
2023-12-21 19:53:32 +00:00
Aaron Gokaslan
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
Sam Larsen
cd449e260c Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2023-12-19 23:08:05 +00:00
Tugsbayasgalan Manlaibaatar
d85314c95c Support Predispatch functionalization (#113728)
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:

1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.

2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.

3.  We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.

Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
2023-12-19 20:28:35 +00:00
cdzhan
99554112d3 [pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623)
Fixes compilation failure in some environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623
Approved by: https://github.com/albanD
2023-12-13 00:59:01 +00:00
Jesse Cai
4471fe6c39 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_sparse_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 23:08:51 +00:00
PyTorch MergeBot
40a14e07ef Revert "[sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)"
This reverts commit 1e5636f791.

Reverted https://github.com/pytorch/pytorch/pull/115178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the Window build failure looks legit 1e5636f791 ([comment](https://github.com/pytorch/pytorch/pull/115178#issuecomment-1850605711))
2023-12-11 18:07:17 +00:00
Jesse Cai
1e5636f791 [sparse][semi-structured] add alg_id to _cslt_sparse_mm and _cslt_spasre_mm_search (#115178)
Summary:

cuSPARSELt has support for different alg_id, which are set via

`cusparseLTMatmulAlgSetAttribute`, in total there are 4 different
alg_ids, 0 - 3.

Previously we were just using the default alg_id, as from our initial
experiments we found that for most shapes the default alg_id is the
fastest and that they made no difference on numerical correctness, just
performance. From our previous experiments the fastest alg_id seemed to
differ only on small matmul shapes.

danthe3rd found a performance regression when running with
cuSPARSELt v0.4.0 vs v0.5.0, on LLM shapes, which match these
characteristics (activations are small, weights are large).

However it's likely that this is due to the alg_id ordering changing, as
mentioned in the release notes for v0.5.0.
```
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of
algorithm id alg as in v0.4.0.
```

This PR adds in the following:
- support for passing in alg_id to _cslt_sparse_mm
- a new op, _cslt_sparse_mm_search, which returns the optimal alg_id for
  a given matmul

_cslt_sparse_mm_search has the same function signature as
_cslt_sparse_mm, minus the alg_id parameter.
We are able to achieve v0.4.0 performance with alg_id=1 on the shapes
that daniel provided.

We will address autoselecting the best alg_id in a future PR, possibly
with torch.compile.

Test Plan:
```
python test/test_sparse_semi_structured -k cslt
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115178
Approved by: https://github.com/cpuhrsch
2023-12-11 15:47:28 +00:00
Mengwei Liu
898554a3a3 [torchgen] Add logic in custom ops to return empty tensor (#114143)
Summary: Add two logic:

1. If the custom op is returning a `Tensor` but also doesn't have an out tensor as input, return an empty tensor.
2. If the custom op is returning more than one Tensor and the number of out tensors is not the same as return Tensor, return a tuple of empty tensors.

Test Plan: Rely on new unit tests

Differential Revision: D51471651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114143
Approved by: https://github.com/cccclai
2023-12-08 17:03:44 +00:00
voznesenskym
ddf1cb7870 AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-28 19:33:35 +00:00
Tarun Karuturi
39f16c221e Adding event_tracer evalue logging calls in codegen (#114584)
Summary:
This diff adds support in the ExecuTorch codegen layer to log the outputs of kernels to event_tracer. It does this by calling the `event_tracer_log_evalue` API.

When the `ET_EVENT_TRACER_ENABLED` flag is disabled this is essentially a no-op and will add no overhead.

Test Plan: CI

Reviewed By: larryliu0820

Differential Revision: D51534590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114584
Approved by: https://github.com/larryliu0820
2023-11-28 18:32:05 +00:00
Nikita Shulga
7c98bac4a0 [BE] Speedup register schema compilation (#114438)
For some reason, inlining initializer list into a std::vector takes a lot of time using clang-15. But considering that there are only dozen or so distrinct tags, creating them once and pass as def argument should not affect runtime speed at all, but this significantly improves compilation time. On Mac M1 it reduces time needed to compiler RegisterSchema.cpp from 50 to 3 seconds.

Special case empty tags, to keep torch_gen tests happy

Before
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 131.8054 seconds (132.5540 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
  43.6364 ( 33.2%)   0.0919 ( 30.1%)  43.7282 ( 33.2%)  43.9658 ( 33.2%)  536345245380  ModuleInlinerWrapperPass
  43.6291 ( 33.2%)   0.0891 ( 29.2%)  43.7182 ( 33.2%)  43.9549 ( 33.2%)  536264096394  DevirtSCCRepeatedPass
  42.3766 ( 32.2%)   0.0185 (  6.1%)  42.3951 ( 32.2%)  42.6198 ( 32.2%)  523040901767  GVNPass
   0.4085 (  0.3%)   0.0040 (  1.3%)   0.4125 (  0.3%)   0.4195 (  0.3%)  4106085945  SimplifyCFGPass
   0.3611 (  0.3%)   0.0115 (  3.8%)   0.3726 (  0.3%)   0.3779 (  0.3%)  4864696407  InstCombinePass
   0.1607 (  0.1%)   0.0088 (  2.9%)   0.1695 (  0.1%)   0.1720 (  0.1%)  1780986175  InlinerPass
   0.0865 (  0.1%)   0.0024 (  0.8%)   0.0889 (  0.1%)   0.0914 (  0.1%)  1489982961  SROAPass
   0.0750 (  0.1%)   0.0013 (  0.4%)   0.0763 (  0.1%)   0.0764 (  0.1%)  620016338  SCCPPass
   0.0661 (  0.1%)   0.0040 (  1.3%)   0.0701 (  0.1%)   0.0735 (  0.1%)  592027163  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 48.2802 seconds (48.8638 wall clock)
...
 ```

After
```
% /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -ftime-report -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/Users/nshulga/git/pytorch/pytorch/build/aten/src -I/Users/nshulga/git/pytorch/pytorch/aten/src -I/Users/nshulga/git/pytorch/pytorch/build -I/Users/nshulga/git/pytorch/pytorch -I/Users/nshulga/git/pytorch/pytorch/cmake/../third_party/benchmark/include -I/Users/nshulga/git/pytorch/pytorch/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/build/third_party/onnx -I/Users/nshulga/git/pytorch/pytorch/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/build/third_party/foxi -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api -I/Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include -I/Users/nshulga/git/pytorch/pytorch/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src/TH -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/aten/src -I/Users/nshulga/git/pytorch/pytorch/build/caffe2/../aten/src -I/Users/nshulga/git/pytorch/pytorch/torch/csrc -I/Users/nshulga/git/pytorch/pytorch/third_party/miniz-2.1.0 -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/include -I/Users/nshulga/git/pytorch/pytorch/third_party/kineto/libkineto/src -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/FXdiv/include -I/Users/nshulga/git/pytorch/pytorch/c10/.. -I/Users/nshulga/git/pytorch/pytorch/third_party/pthreadpool/include -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/Users/nshulga/git/pytorch/pytorch/third_party/cpuinfo/deps/clog/include -I/Users/nshulga/git/pytorch/pytorch/third_party/NNPACK/include -I/Users/nshulga/git/pytorch/pytorch/third_party/FP16/include -I/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include -I/Users/nshulga/git/pytorch/pytorch/third_party/flatbuffers/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/googletest/googletest/include -isystem /Users/nshulga/git/pytorch/pytorch/third_party/protobuf/src -isystem /Users/nshulga/git/pytorch/pytorch/third_party/XNNPACK/include -isystem /Users/nshulga/git/pytorch/pytorch/cmake/../third_party/eigen -isystem /Users/nshulga/git/pytorch/pytorch/build/include  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=braced-scalar-init -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wvla-extension -Wsuggest-override -Wnewline-eof -Winconsistent-missing-override -Winconsistent-missing-destructor-override -Wno-pass-failed -Wno-error=pedantic -Wno-error=old-style-cast -Wno-error=inconsistent-missing-override -Wno-error=inconsistent-missing-destructor-override -Wconstant-conversion -Wno-invalid-partial-specialization -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -Werror -Wno-unused-but-set-variable -fno-math-errno -fno-trapping-math -Werror=format -DUSE_MPS -Wno-unused-private-field -Wno-missing-braces -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.0.sdk -fPIC -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-unused-function -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -fvisibility=hidden -O2 -Wmissing-prototypes -Werror=missing-prototypes -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -std=gnu++17 -Wno-missing-prototypes -Wno-error=missing-prototypes -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterSchema.cpp.o -c /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/RegisterSchema.cpp
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1.2920 seconds (1.3187 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
   0.3070 ( 27.6%)   0.0547 ( 30.2%)   0.3617 ( 28.0%)   0.3654 ( 27.7%)  3719690895  ModuleInlinerWrapperPass
   0.3024 ( 27.2%)   0.0525 ( 29.0%)   0.3549 ( 27.5%)   0.3585 ( 27.2%)  3653363330  DevirtSCCRepeatedPass
   0.0619 (  5.6%)   0.0073 (  4.0%)   0.0692 (  5.4%)   0.0711 (  5.4%)  868136227  InstCombinePass
   0.0601 (  5.4%)   0.0065 (  3.6%)   0.0666 (  5.2%)   0.0679 (  5.1%)  696430647  InlinerPass
   0.0363 (  3.3%)   0.0033 (  1.8%)   0.0396 (  3.1%)   0.0425 (  3.2%)  535426974  SimplifyCFGPass
   0.0280 (  2.5%)   0.0069 (  3.8%)   0.0348 (  2.7%)   0.0358 (  2.7%)  378716394  BlockFrequencyAnalysis
   0.0208 (  1.9%)   0.0049 (  2.7%)   0.0257 (  2.0%)   0.0262 (  2.0%)  283689627  BranchProbabilityAnalysis
   0.0239 (  2.1%)   0.0002 (  0.1%)   0.0241 (  1.9%)   0.0241 (  1.8%)  219122704  OpenMPOptCGSCCPass
   0.0174 (  1.6%)   0.0015 (  0.8%)   0.0189 (  1.5%)   0.0192 (  1.5%)  215583965  GVNPass
   0.0153 (  1.4%)   0.0025 (  1.4%)   0.0178 (  1.4%)   0.0187 (  1.4%)  184232295  EarlyCSEPass
...
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.9128 seconds (3.1027 wall clock)
...
```

And the generated schema file looks as follows:
```cpp
TORCH_LIBRARY(aten, m) {
  const std::vector<at::Tag> tags_0 = {at::Tag::pt2_compliant_tag};
  m.def("_cast_Byte(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Char(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Double(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Float(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Int(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Long(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Short(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_cast_Half(Tensor self, bool non_blocking=False) -> Tensor", tags_0);
  m.def("_backward(Tensor self, Tensor[] inputs, Tensor? gradient=None, bool? retain_graph=None, bool create_graph=False) -> ()", tags_0);
  m.def("set_data(Tensor(a!) self, Tensor new_data) -> ()", tags_0);
  m.def("data(Tensor self) -> Tensor", tags_0);
  m.def("is_leaf(Tensor self) -> bool", tags_0);
  m.def("output_nr(Tensor self) -> int", tags_0);
  m.def("_version(Tensor self) -> int", tags_0);
  m.def("requires_grad_(Tensor(a!) self, bool requires_grad=True) -> Tensor(a!)", tags_0);
  m.def("retain_grad(Tensor(a!) self) -> ()", tags_0);
  m.def("retains_grad(Tensor self) -> bool", tags_0);
  m.def("_fw_primal(Tensor(a) self, int level) -> Tensor(a)", tags_0);
  m.def("_make_dual(Tensor(a) primal, Tensor tangent, int level) -> Tensor(a)", tags_0);
  m.def("_unpack_dual(Tensor(a) dual, int level) -> (Tensor(a) primal, Tensor tangent)", tags_0);
  m.def("_new_zeros_with_same_feature_meta(Tensor self, Tensor other, *, int self_num_batch_dims=0) -> Tensor", tags_0);
  m.def("_has_same_storage_numel(Tensor self, Tensor other) -> bool", tags_0);
  const std::vector<at::Tag> tags_1 = {at::Tag::inplace_view, at::Tag::pt2_compliant_tag};
  m.def("rename_(Tensor(a!) self, Dimname[]? names) -> Tensor(a!)", tags_1);
  m.def("rename(Tensor(a) self, Dimname[]? names) -> Tensor(a)", tags_0);
  m.def("align_to(Tensor(a) self, Dimname[] names) -> Tensor(a)", tags_0);
  m.def("align_to.ellipsis_idx(Tensor(a) self, Dimname[] order, int ellipsis_idx) -> Tensor(a)", tags_0);
  m.def("align_as(Tensor self, Tensor other) -> Tensor", tags_0);
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114438
Approved by: https://github.com/zou3519
2023-11-27 23:33:04 +00:00
PyTorch MergeBot
3e1abde46d Revert "AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)"
This reverts commit a911b4db9d.

Reverted https://github.com/pytorch/pytorch/pull/111554 on behalf of https://github.com/DanilBaibak due to The lower PR in the stack #113926 breaks the internal build ([comment](https://github.com/pytorch/pytorch/pull/111554#issuecomment-1822472206))
2023-11-22 10:13:48 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
voznesenskym
a911b4db9d AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554)
This should be enough to get @voznesenskym 's FSDP branch to plumb `set_()` through AOTAutograd properly and have everything properly no-op out. Main changes are:

(1) graph break on `aten::set_.source_Tensor_storage_offset` (we could support it but it isn't needed, seems safer to graph break)

(2) Functionalization: add a "proper" functionalization kernel for `aten::set_.source_Tensor`. The previous one we had was codegen'd and it was wrong (it would just clone() and call set_(), which does not do the right thing). I also manually mark on the `FunctionalTensorWrapper` when a given tensor has been mutated by a `set_()` call.

(3) AOTAutograd: I added a new field, `InputAliasInfo.mutates_storage_metadata`, so we can distinguish between "regular" metadata mutations, and metadata mutations due to `set_()` calls. This is mainly because at runtime, one requires calling `as_strided_()` to fix up metadata, while the other requires calling `set_()`.

(4) Made AOTAutograd's detection for metadata mutations / set_() mutations smarter and detect no-ops (if the storage and metadata are all the same).

I also killed `was_updated()` and `was_metadata_updated()`, and replaced them with (existing) `has_data_mutation() ` and (new) `has_data_mutation()`, which can more accurately distinguish between data-mutation vs. `set_()` calls vs. metadata-mutation

**This PR is still silently correct in one case though**, which I'd like to discuss more. In particular, this example:
```
def f(x):
    x_view = x.view(-1)
    x.set_(torch.ones(2))
    x_view.mul_(2)
    return
```

If you have an input that experiences both a data-mutation **and** a `x_old.set_(x_new)` call, there are two cases:

(a) the data mutation happened on the storage of `x_new`. This case should be handled automatically: if x_new is a graph intermediate then we will functionalize the mutation. If x_new is a different graph input, then we will perform the usual `copy_()` on that other graph input

(b) the data mutation happened on the storage of `x_old`. This is more of a pain to handle, and doesn't currently work. At runtime, the right thing to do is probably something like:
```

def functionalized_f(x):
    x_view = x.view(-1)
    # set_() desugars into a no-op; later usages of x will use x_output
    x_output = torch.ones(2)
    # functionalize the mutation on x_view
    x_view_updated = x.mul(2)
    x_updated = x_view_updated.view(x.shape)
    # x experienced TWO TYPES of mutations; a data mutation and a metatadata mutation
    # We need to return both updated tensors in our graph
    return x_updated, x_output
def runtime_wrapper(x):
    x_data_mutation_result, x_set_mutation_result = compiled_graph(x)
    # First, perform the data mutation on x's old storage
    x.copy_(x_data_mutation_result)
    # Then, swap out the storage of x with the new storage
    x.set_(x_set_mutation_result)
```

There are two things that make this difficult to do though:

(1) Functionalization: the functionalization rule for `set_()` will fully throw away the old `FunctionalStorageImpl` on the graph input. So if there are any mutations to that `FunctionalStorageImpl` later on in the graph, the current graph input won't know about it. Maybe we can have a given `FunctionalTensorWrapper` remember all previous storages that it had, and track mutations on all of them - although this feels pretty complicated.

(2) AOTAutograd now needs to know that we might have *two* graph outputs that correspond to a single "mutated input", which is annoying.

It's worth pointing out that this issue is probably extremely unlikely for anyone to run into - can we just detect it and error? This feels slightly easier than solving it, although not significantly easier. We would still need `FunctionalTensorWrapper` to keep track of mutations on any of its "previous" storages, so it can report this info back to AOTAutograd so we can raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111554
Approved by: https://github.com/ezyang
ghstack dependencies: #113926
2023-11-21 01:52:46 +00:00
Edward Z. Yang
8c4812be80 Replace expect_int with guard_int (#113921)
The idea is that instead of erroring, we will just specialize at these sites.

Fixes https://github.com/pytorch/pytorch/issues/113142

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113921
Approved by: https://github.com/zou3519
2023-11-20 21:27:48 +00:00
Brian Vaughan
dbb96ef30d improve annotation device parameters where a device ordinal is allowed (#113647)
Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal.

`error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str | device"  [arg-type]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647
Approved by: https://github.com/albanD
2023-11-17 14:41:22 +00:00
Jane Xu
deec2380c7 Add 0dim Tensor overload for _foreach_div (#113688)
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors.

Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise.

After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688
Approved by: https://github.com/mlazos, https://github.com/albanD
2023-11-15 20:59:32 +00:00
George White
6c187246d6 Add support for float8_e4m3fnuz and _e5m2fnuz (#107586)
This PR relates to the feature in [this feature submission](https://docs.google.com/document/d/1pF2T1xz54IPg1jG7FhykbrpbcJZVelQw0v8vBaoLkfs/edit). It has been based on #104242 which adds similar float8 types.

These new types added in this PR are described in the paper at https://arxiv.org/abs/2206.02915. A brief description and comparison of the types with other float8 types can be also found in the [OpenXLA RFC](https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107586
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-11-15 15:01:11 +00:00
PyTorch MergeBot
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00
Aaron Gokaslan
18d7b8e4f7 [BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644)
Reopens #111682 that I messed up due to a bad rebase and triggered some issues with CLA. This explicitly adds check=True or False to any subprocess calls where appropriate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113644
Approved by: https://github.com/ezyang, https://github.com/kit1980
2023-11-14 20:59:40 +00:00