Commit Graph

180 Commits

Author SHA1 Message Date
leslie-fang-intel
d83ab88f81 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 08:40:44 +00:00
PyTorch MergeBot
ea3f625e32 Revert "[Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)"
This reverts commit 33e6791645.

Reverted https://github.com/pytorch/pytorch/pull/124041 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think there is a land race with the change 33e6791645 ([comment](https://github.com/pytorch/pytorch/pull/124041#issuecomment-2101766558))
2024-05-09 01:34:19 +00:00
leslie-fang-intel
33e6791645 [Inductor] [Quant] Enable lowering of quant per tensor and refactor quant pattern (#124041)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:

- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.

**Changes in this PR**

- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-09 00:54:22 +00:00
Andrew M. James
445a0c01da Retry: Low mem max_pool2d_with_indices (#122832)
Based on #105687

The low memory path does not need to strictly return the int8 offsets
instead the offset to index computation can be separated from the
inner function of the max pool lowering. The partitioner can then choose
to move the offset to index computation into the backward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122832
Approved by: https://github.com/peterbell10, https://github.com/eellison
2024-05-08 19:37:08 +00:00
Isuru Fernando
4d5f8070c4 add a decomposition for select_scatter (#124426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426
Approved by: https://github.com/peterbell10
2024-05-01 03:23:18 +00:00
Kazuaki Ishizaki
9fec26e231 Fix typo under torch/_inductor directory (#119658)
This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658
Approved by: https://github.com/colesbury
2024-04-30 22:28:56 +00:00
Bin Bao
bb37910e30 [AOTI] Fixes ScatterFallback codegen (#124580)
Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures.

Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580
Approved by: https://github.com/chenyang78
2024-04-22 20:47:26 +00:00
Pearu Peterson
43b4ac956e Add index_reduce decomposition (#122579)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122579
Approved by: https://github.com/peterbell10
ghstack dependencies: #123375
2024-04-18 01:30:47 +00:00
Honglin Zhu
78824fd212 [inductor] Fix recompiles bug for torch.full (#123811)
Fixes #123810

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123811
Approved by: https://github.com/peterbell10
2024-04-12 00:07:47 +00:00
Peter Bell
9189d04cb1 [inductor] Add explicit ops.fma and use it in softmax_backward (#122518)
This allows us to generate an fma even when fp-fusion is disabled
in the compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518
Approved by: https://github.com/lezcano, https://github.com/Chillee
2024-04-06 02:15:16 +00:00
PyTorch MergeBot
16cb5d48dd Revert "[inductor] Add explicit ops.fma and use it in softmax_backward (#122518)"
This reverts commit 05984e642b.

Reverted https://github.com/pytorch/pytorch/pull/122518 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it starts failing in trunk 05984e642b ([comment](https://github.com/pytorch/pytorch/pull/122518#issuecomment-2038631010))
2024-04-05 02:09:32 +00:00
Peter Bell
05984e642b [inductor] Add explicit ops.fma and use it in softmax_backward (#122518)
This allows us to generate an fma even when fp-fusion is disabled
in the compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122518
Approved by: https://github.com/lezcano, https://github.com/Chillee
ghstack dependencies: #121924
2024-04-04 20:53:14 +00:00
Gao Tianlin
aaef246c74 remove log2 decomposition; add log2 lowering (#123112)
Same reason as `log10`. `log2` is a core aten op, we should not decompose it. As https://github.com/pytorch/pytorch/pull/110882 suggested, it often maps to a hardware intrinsic; Furthermore, decomposing it will negatively impact the numerical precision of the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123112
Approved by: https://github.com/peterbell10
2024-04-02 16:16:26 +00:00
Mu-Chu Lee
4b725e1619 [AOTInductor] Support quantized linear on CPU with fbgemm (#123069)
Summary:
Added support for quantized linear on CPU with fbgemm.
Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we
decompose it into two steps, pack weight, and fbgemm's qlinear with
packed weight.

Test Plan:
Included in commit.
test_aot_inductor::test_quantized_linear

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069
Approved by: https://github.com/hl475
2024-04-01 09:15:05 +00:00
andrewor14
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
Shunting Zhang
1ce5049692 [inuctor] fix the layout problem for nll_loss2d_backward (#121173)
Fixes https://github.com/pytorch/pytorch/issues/120759 .

The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op.

Not sure if we can improve the cuda kernel to release the constraints though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-07 09:05:07 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Pearu Peterson
c06499981d Add a decomposition for torch.put, 2. (#120179)
As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179
Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5
2024-03-04 14:37:30 +00:00
Xia, Weiwen
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
Edward Z. Yang
2a08a51738 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
ghstack dependencies: #120800
2024-03-01 05:06:36 +00:00
angelayi
759204253f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-26 17:56:12 +00:00
Angela Yi
6d82a7e9b0 Add pixel_shuffle to core aten decomps (#120092)
Summary:
https://github.com/pytorch/pytorch/pull/118239 added a decomposition
for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We
have also fixed the internal use case so that it no longer special cases on
pixel_shuffle, allowing us to revert the changes in
https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53860966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092
Approved by: https://github.com/ydwu4
2024-02-20 18:37:32 +00:00
PyTorch MergeBot
86dedebeaf Revert "Add pixel_shuffle to core aten decomps (#119899)"
This reverts commit 9201d7335a.

Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))
2024-02-16 17:44:59 +00:00
PyTorch MergeBot
47300221c2 Revert "[export] Change runtime asserts to using assert_scalar (#119608)"
This reverts commit f4d641ba2f.

Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job 65fd8b6730 ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))
2024-02-15 22:25:24 +00:00
angelayi
f4d641ba2f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-15 07:13:42 +00:00
Angela Yi
9201d7335a Add pixel_shuffle to core aten decomps (#119899)
Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53766709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-02-14 21:01:11 +00:00
Edward Z. Yang
c2522554dd Prevent DCE'ing unbacked SymInt for view outputs (#119552)
Fixes https://github.com/pytorch/pytorch/issues/119414

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-02-13 16:32:21 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
angelayi
1adedc3c86 [decomp] Remove pixel_shuffle from core aten decomps (#118921)
pixel_shuffle is a core aten op
(https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it.

https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure
(https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921
Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano
2024-02-03 08:21:32 +00:00
Mengwei Liu
1a8545164a [export] Add unit test for SDPA export result (#117390)
Summary:

A follow up for #117097. In that PR I didn't add
`_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition
table. This PR does that and also add a unit test.

Test Plan: python test/export/test_export.py -k
test_scaled_dot_product_attention

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390
Approved by: https://github.com/drisspg
2024-01-14 00:21:28 +00:00
Yanbo Liang
abd80cbb15 [Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582)
We found this perf optimization opportunity at https://github.com/pytorch-labs/gpt-fast/pull/71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116582
Approved by: https://github.com/lezcano
2024-01-01 21:24:02 +00:00
Peter Bell
fb80f05ee2 [inductor] Fix angle decomposition return type (#115700)
The current decomposition always returns float32 when the input isn't complex.
Instead, we should do proper type promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699
2023-12-13 14:16:31 +00:00
Peter Bell
9cdc80d581 [inductor] Fix torch.bernoulli decomposition return type (#115699)
Strangely enough, `torch.bernoulli` doesn't return a boolean and instead
it matches the output type of the inplace bernoulli.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699
Approved by: https://github.com/lezcano
ghstack dependencies: #115677
2023-12-13 14:16:31 +00:00
Jez Ng
fda94124d7 [inductor] Make {cudagraph_trees,decomposition,post_grad}.py pass follow_imports typechecking (#113609)
I added explicit imports to `kernel/__init__.py` as mypy doesn't seem to
understand an empty `__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113609
Approved by: https://github.com/eellison
2023-11-15 05:04:11 +00:00
Adnan Akhundov
8d41a5c605 [indictor] Fix cat decomp when first tensor is empty (#113514)
Summary: Previously, when the first tensor argument to `aten.cat` was empty and there was only one non-empty tensor argument, the first (empty) tensor was erroneously returned by the `aten.cat` decomposition. Here we fix the bug.

Test Plan:

```
$ python test/inductor/test_torchinductor.py -k test_cat_empty
...
----------------------------------------------------------------------
Ran 2 tests in 5.760s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113514
Approved by: https://github.com/jansel
2023-11-11 20:34:22 +00:00
Aleksei Nikiforov
65304d8fd0 s390x: fix inductor constructing floats out of bytes (#112723)
This change fixes test_embedding_bag_byte_unpack_cpu from test/inductor/test_torchinductor.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112723
Approved by: https://github.com/jansel
2023-11-07 06:51:46 +00:00
leslie-fang-intel
6ba2748690 [Quant] [PT2] Enable Decomposed quant per tensor/channel to accept bfloat16 input (#112225)
**Summary**
- PR 4 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable `decomposed quant_per_tensor` and `quant_per_channel` accepts bfloat16 input.

**TestPlan**
```
python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_tensor_bfloat16_input
python -m pytest test_quantized_tensor.py -k test_decomposed_quantize_per_channel_bfloat16_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112225
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-11-03 23:47:43 +00:00
Peter Bell
1dd57082a4 [inductor] Decompose boolean min/max into all/any (#110311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110311
Approved by: https://github.com/lezcano
ghstack dependencies: #110310
2023-10-24 21:33:53 +00:00
Hongtao Yu
6977ba6e3c [inductor] decomposition for complex addition (#110740)
Tracks https://github.com/pytorch/pytorch/issues/98161

Complex number support in Pytorch isn't ideal today as complex operations will mostly end up taken care of by the aten runtime, except for `torch.angle` which is handled in [105609](https://github.com/pytorch/pytorch/pull/105609). In general a better way to handle that could be to decompose complex operations first so that more opportunities for fusion could be unveiled, and then to have Triton take care of non-continuous (strided) tensor operations more efficiently. This change adds support to decompose complex addtions.

```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 6
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(out_ptr0 + (x0), tmp2, xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110740
Approved by: https://github.com/jansel
2023-10-24 03:41:24 +00:00
PyTorch MergeBot
98c329b19e Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906)"
This reverts commit 9606cda64e.

Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))
2023-10-11 11:41:21 +00:00
SS-JIA
9606cda64e [core ATen IR] Add decompositions for max, min, var_mean (#110906)
## Context

Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:

```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```

For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano

Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
2023-10-11 00:06:24 +00:00
Stephen Jia
c2e7a0d689 [core IR] Add decomps for aten.sum and aten.squeeze variants (#110645)
Summary:
## Context

Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.

Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10

Test Plan: Github CI + Meta Internal CI

Differential Revision: D49965952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
2023-10-07 04:21:51 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
Bert Maher
4c3d3b7176 [inductor] Lower small gemvs on CPU (#110456)
If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to
do much better than compiling a simple for-loop, and we end up paying
allocation overhead and ATen overhead.

A very small internal inference model drops from 7->5 us with this change.

Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456
Approved by: https://github.com/chenyang78, https://github.com/jgong5
2023-10-04 15:16:38 +00:00
Stephen Jia
ff96f6d04f [core IR][reland] Add split.Tensor and unbind decompositions to core ATen decomp table (#110323)
Summary:
This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102).

The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions.

Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)). Therefore decomposing `lift_fresh` will interfere with constant propogation during export.

Test Plan: Github CI and internal CI

Differential Revision: D49761321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323
Approved by: https://github.com/jansel
2023-10-03 14:35:04 +00:00
eellison
3812f2e40c Preserve layout on like constructors (#110242)
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.

This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
2023-10-02 23:53:55 +00:00
chilli
13681382d5 Add heuristic for when evict_first should be set (and some other minor things) (#108841)
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
    return (a * b).sum(dim=-1)

N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```

This generates code like this: http://ix.io/4HFs

```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-01 17:06:12 +00:00