Commit Graph

247 Commits

Author SHA1 Message Date
Shangdi Yu
3e05a48927 Fix clamp type promotion in inductor decomposition (#154471)
Summary: as title, the clamp type promotion should take min/max arg into consideration as well.

Test Plan:
```
buck run fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_clamp_decomposition_cpu
python test/inductor/test_torchinductor.py -k test_clamp -v
```

Differential Revision: D75490124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154471
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2025-05-28 23:24:25 +00:00
PyTorch MergeBot
d81217be2e Revert "Improve torch.ops typing (#153558)"
This reverts commit c5cba39d46.

Reverted https://github.com/pytorch/pytorch/pull/153558 on behalf of https://github.com/yangw-dev due to Your diff will not be landed to fbcode since we suspect it caused the following breakage in an internal test:[D75007157](https://www.internalfb.com/diff/D75007157) for instance: tests_gpu/lookup_gpu_index_test.py:232:8 Undefined attribute [16]: torch._ops._OpNamespace has no attribute simple_index_mm_batch ([comment](https://github.com/pytorch/pytorch/pull/153558#issuecomment-2892506789))
2025-05-19 23:32:36 +00:00
Benjamin Glass
c5cba39d46 Improve torch.ops typing (#153558)
Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153558
Approved by: https://github.com/rec, https://github.com/Skylion007, https://github.com/cyyever
2025-05-19 14:52:32 +00:00
Zhang, Jianyi
1bc5762495 [Intel GPU][Inductor] Fallback embedding_dense_backward on XPU (#151637)
Reopen #146888, now the modification only affects xpu device. We do not  want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151637
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
2025-05-19 02:19:37 +00:00
Pian Pawakapan
8ea95d2e73 [inductor] dtype promotion error in cat decomp (#152995)
cloning single tensor wasn't following dtype promotion rules
for SAM model: https://github.com/pytorch/pytorch/issues/152606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
2025-05-09 16:58:58 +00:00
PaulZhang12
84aa0985fb [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-03 02:23:54 +00:00
Laith Sakka
376529c78b consolidate guard_or_x and definitely_x (#152463)
definitely_true is almost same as guard_or_false, the potential differences are not meaningful to a degree that justify the
existence of both. same for definitely_false, it can be expressed with guard_or_true and guard_or_false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152463
Approved by: https://github.com/bobrenjc93
2025-05-02 18:08:11 +00:00
PyTorch MergeBot
7c3e679ddd Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654)"
This reverts commit fdcfc6a61a.

Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](3c54e0c216) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))
2025-05-02 06:31:38 +00:00
PaulZhang12
fdcfc6a61a [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-01 23:01:30 +00:00
Laith Sakka
cbf8e0fb1a use statically known true instead of guard size oblivious in bmm and mm inductor decompositions . (#148893)
this was discussed with @eellison and he recommended using  statically_known_true here, the intuition is. We already have 0/1 specializations in place, if we reach those checks with dynamic shapes that are not already specialized
then we do not want them to specialize them, "a recompilation here is not justified".
Those are all non-semantic changing optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148893
Approved by: https://github.com/eellison
2025-04-28 16:44:25 +00:00
Manuel Candales
f38566dfe4 [MPSInductor] Disable mm/bmm decompositions (#150541)
Disables mm/bmm decompositions.
torch.compile on MPS was speeding up stories15M (~4x) but it was making stories110M much slower.

Self-contained reproducer to demonstrate the difference (before the change, after it should be identical)
```python
import torch
import timeit

def bench_mm(f, x, y):
    from torch.utils.benchmark import Timer
    return Timer(stmt="f(x, y); torch.mps.synchronize()",
                 globals={"x": x, "y": y, "f": f},
                  language="python", timer=timeit.default_timer).blocked_autorange()

x = torch.rand(1024, 512, device='mps')
y = torch.rand(512, 1, device='mps')

mm_c = torch.compile(torch.mm, options={"coordinate_descent_tuning": False})
mm_c_cdt = torch.compile(torch.mm, options={"coordinate_descent_tuning": True})

print(f"Compiled torch.mm perf (with cdt disabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c, x, y).median}")
print(f"Compiled torch.mm perf (with cdt enabled) for 1024x512 and  512x1 matrices are {bench_mm(mm_c_cdt, x, y).median}")
```

Disabling the inductor mm decomposition, speeds up stories15M further (~6x) and speeds up stories110M (~7x)
The table below show average tokens/sec across 5 runs on M1 Pro for stories15M and stories110M:

|                        | stories15M | stories110M |
|------------------------|------------|-------------|
| without compile         | 99.40      | 53.11       |
| compile before change   | 367.68     | 19.43       |
| compile after change    | 582.96     | 355.07      |

stories110M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 53.11
```

stories110M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 19.43
```

stories110M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 355.07
```

stories15M (without compile)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps
[...]
Average tokens/sec: 99.40
```

stories15M (compile before change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 367.68
```

stories15M (compile after change)
```
(gptfast) mcandales@mcandales-mbp gpt-fast % python generate.py --checkpoint_path checkpoints/stories110M/stories110M.pt --prompt "Once upon a time" --device mps --compile
[...]
Average tokens/sec: 582.96
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150541
Approved by: https://github.com/malfet
2025-04-02 16:07:18 +00:00
Isuru Fernando
82ceebce58 [inductor] Lowerings for max_pool3d (#148210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148210
Approved by: https://github.com/eellison
2025-04-02 14:13:01 +00:00
Scott Wolchok
dc39e673e2 Remove aten.elu core ATen decomp because it is now core ATen (#149780)
Per @larryliu0820.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149780
Approved by: https://github.com/larryliu0820
2025-03-25 01:59:57 +00:00
Isuru Fernando
66b0a0b61a [inductor] support dilation in max_pool2d lowering (#148209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148209
Approved by: https://github.com/eellison
2025-03-24 13:00:12 +00:00
Xuehai Pan
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
Aaron Orenstein
893ca1dfe1 PEP585 update - torch/_inductor/[_-i]* (#145137)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137
Approved by: https://github.com/bobrenjc93
2025-01-19 01:22:47 +00:00
Tom Ritchford
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
bobrenjc93
a3ab27b8e0 Migrate from Tuple -> tuple in torch/_inductor (#144264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264
Approved by: https://github.com/eellison
2025-01-07 03:27:27 +00:00
Aaron Orenstein
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
Michael Lazos
8960cb5809 Add support for bfloat16 atomic adds in fbcode (#143629)
Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629
Approved by: https://github.com/eellison
2024-12-20 23:05:13 +00:00
Michael Lazos
b4e0e3bfa3 Backout D66648013 (#143433)
Summary:
backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification)

I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression.

Test Plan: This is a revert

Differential Revision: D67357485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433
Approved by: https://github.com/davidberard98
2024-12-19 00:53:49 +00:00
Michael Lazos
a3abe1a5ae Add support for bfloat16 atomic adds in fbcode (#141857)
This adds support for bfloat16 atomic add in fbcode (OSS will have to wait until those changes are upstreamed to triton)

Originally I attempted to write inline asm, but the triton API was not flexible enough to support this use case. In the long run the right answer is to implement this properly in OSS triton.

relevant issues:
* https://github.com/pytorch/pytorch/issues/137425 in fbcode only
* https://github.com/pytorch/pytorch/issues/97016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141857
Approved by: https://github.com/eellison
2024-12-10 11:40:15 +00:00
IvanKobzarev
f85e238186 [aotd] capture rrelu_with_noise noise mutation in compile (#141867)
Rebase-copy of long standing already approved PR https://github.com/pytorch/pytorch/pull/138503 that was blocked on landing by xla build issues.

Got a new  PR with the same content (ghstack checkout was failing due to changed submodules)

Corresponding xla PR:
https://github.com/pytorch/xla/pull/8363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141867
Approved by: https://github.com/bdhirsh
2024-12-04 12:18:58 +00:00
Chien-Lin Chen
161425ff9f Added aten.bernoulli.p and aten.bernoulli.default decompositions (#139141)
Fixes #105519

Added aten.bernoulli.p decomposition and moved/rewrote aten.bernoulli.deafult to make them included in core aten decomposition.

Tested the sample code in [105519](https://github.com/pytorch/pytorch/issues/105519), torch.bernoulli could be decomposed by the code snippet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139141
Approved by: https://github.com/eellison
2024-11-20 19:52:57 +00:00
eellison
34e420519d [Reland] dont decompose baddbmm (#141045)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Reland of https://github.com/pytorch/pytorch/pull/137904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141045
Approved by: https://github.com/BoyuanFeng
2024-11-19 21:07:58 +00:00
Masaki Kozuki
6a368b3fc5 Add ScalarList overload to _foreach_lerp (#134482)
Related:
- https://github.com/pytorch/pytorch/issues/133367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482
Approved by: https://github.com/janeyx99
2024-11-12 19:03:41 +00:00
leslie-fang-intel
d84a344410 [Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537
Approved by: https://github.com/jansel
2024-11-03 10:10:29 +00:00
PyTorch MergeBot
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bd.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
PyTorch MergeBot
6aef58a249 Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit c066f4a055.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the test is failing in trunk, maybe a landrace? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2443158194))
2024-10-29 04:08:11 +00:00
eellison
c066f4a055 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-29 00:54:29 +00:00
Tom Ritchford
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
Tom Ritchford
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
PyTorch MergeBot
af306a392c Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit 7a117f3b3e.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))
2024-10-18 16:01:01 +00:00
eellison
7a117f3b3e Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-17 19:24:54 +00:00
PyTorch MergeBot
5254a0d383 Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit cef6c3dcb0.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))
2024-10-16 23:16:44 +00:00
eellison
cef6c3dcb0 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-15 14:54:56 +00:00
Benjamin Glass
a968576777 Add lowering for aten.searchsorted (#135701)
Adds lowering for `aten.searchsorted`. This entails:

1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.

Closes #135873

Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
2024-10-04 19:26:05 +00:00
Isuru Fernando
ef6fd3d780 Fix adaptive_max_pool2d fallback (#136367)
Fixes #136332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-01 16:20:34 +00:00
Huamin Li
fd494dd426 Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401)
Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix

Differential Revision: D62325142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
chilli
23a2161ad1 Changed addmv to be a decomposition and not a fallback (#134823)
Overall seems to be faster

![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823
Approved by: https://github.com/jansel
ghstack dependencies: #134813, #134818, #134819
2024-09-03 06:33:31 +00:00
Huamin Li
ccafc93be5 [AOTI][CPU] Make int8 qlinear work (#134368)
Summary:
This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim

The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up

Reviewed By: desertfire

Differential Revision: D61396144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368
Approved by: https://github.com/houseroad
2024-08-24 08:25:25 +00:00
eellison
baa4c9ca46 Optimize aten.cat calls of a repeated element (#132081)
This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081
Approved by: https://github.com/shunting314
2024-07-30 02:56:00 +00:00
Tom Ritchford
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
Adnan Akhundov
33069630ce [inductor] Add type hints to functions in decompositions.py (#131780)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780
Approved by: https://github.com/eellison
2024-07-26 04:50:23 +00:00
Aaron Orenstein
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
eellison
67c6941b4e Update torch.cat decomp for 0-dim (#130763)
Fix for https://github.com/pytorch/pytorch/issues/130615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2024-07-16 13:34:01 +00:00
PyTorch MergeBot
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb21098.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
Tom Ritchford
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
Isuru Fernando
c12a4f2e65 Add decomposition for slice_scatter (#123744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744
Approved by: https://github.com/peterbell10
2024-06-28 17:02:10 +00:00