Commit Graph

292 Commits

Author SHA1 Message Date
PyTorch MergeBot
98c329b19e Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906)"
This reverts commit 9606cda64e.

Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))
2023-10-11 11:41:21 +00:00
SS-JIA
9606cda64e [core ATen IR] Add decompositions for max, min, var_mean (#110906)
## Context

Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:

```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```

For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano

Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
2023-10-11 00:06:24 +00:00
Kazuaki Ishizaki
fde28fdc8c Fix typo under torch/_decomp directory (#110821)
This PR fixes typo of comments in files under `torch/_decomp` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110821
Approved by: https://github.com/Skylion007
2023-10-08 20:33:49 +00:00
Stephen Jia
c2e7a0d689 [core IR] Add decomps for aten.sum and aten.squeeze variants (#110645)
Summary:
## Context

Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.

Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10

Test Plan: Github CI + Meta Internal CI

Differential Revision: D49965952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
2023-10-07 04:21:51 +00:00
cdzhan
7cc0020a80 [decomp] Fix different return type in threshold_backward vs. eager (#110689)
due to type promotion with floating point scalar in decompositions.py

Fixes part of #100838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110689
Approved by: https://github.com/ezyang
2023-10-06 20:59:58 +00:00
chilli
ceb773b68d Fix #110680 (requires_grad typo in decomp) (#110687)
Fixes https://github.com/pytorch/pytorch/issues/110680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110687
Approved by: https://github.com/voznesenskym, https://github.com/lezcano
ghstack dependencies: #110501, #110504, #110591, #110668
2023-10-06 10:36:01 +00:00
Jerry Zhang
f2a1b93549 Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226)" (#110316)
Summary:
Original commit changeset: acdb5b34e3aa

Original Phabricator Diff: D47321689

Test Plan: opinfo tests in CI

Differential Revision: D49789403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316
Approved by: https://github.com/kimishpatel
2023-10-03 16:59:23 +00:00
Peter Bell
be3b16daad [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-28 21:23:44 +00:00
SS-JIA
5df8aca994 [core IR] Add a core decomposition for floor_divide (#110046)
## Context

Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.

This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition

```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```

but couldn't discern the reason why this is the case. cc: @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
2023-09-26 08:39:21 +00:00
Mwiza Kunda
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
SS-JIA
7de669f2f9 [core IR] Remove trunc decomp and add trunc to core (#109902)
Following up from [this comment](https://github.com/pytorch/pytorch/pull/109319#discussion_r1330803226). Remove the decomposition for `trunc`, and add it as a core operator.

Going forward, provide similar treatment for operators that map cleanly to hardware instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109902
Approved by: https://github.com/peterbell10
2023-09-25 18:18:06 +00:00
Jijie Wei
334ead04a9 Back out "[decomp] Fix baddbmm decomposition (#109714)" (#109855)
Summary:
Original commit changeset: 95c462a380c9

Original Phabricator Diff: D49484954

this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus'

https://www.internalfb.com/intern/testinfra/testrun/16888498605839953

Differential Revision: D49527271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855
Approved by: https://github.com/yanboliang
2023-09-22 22:01:38 +00:00
Mwiza Kunda
8dedc9dd9b Add meta tests for layer/group/batch norm backward (#109591)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109591
Approved by: https://github.com/ezyang
2023-09-21 18:58:51 +00:00
Peter Bell
6f0cf5a837 [decomp] Decompose unsafe_split{,_with_sizes} into safe variants (#109668)
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
2023-09-20 18:45:56 +00:00
Peter Bell
36a8105f54 [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-20 18:40:21 +00:00
Salil Desai
40b2c796dc [Decomposition] baddbmm (#108534)
Summary:
Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions

ff38c0e2f9/torch/_inductor/decomposition.py (L203)

Test Plan: Phabricator + OSS Tests

Differential Revision: D48871741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534
Approved by: https://github.com/SherlockNoMad
2023-09-20 12:49:32 +00:00
Salil Desai
2e721aab98 [Decomposition] Trunc (#109319)
Summary:
Add Decomp for Trunc and add it to core_aten_decompositions

Differential Revision: D49042033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109319
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:30:13 +00:00
Jez Ng
db48bc80d9 Check index size during decomp of index_add (#108826)
This partially fixes the `test_index_add_correctness` test (#108181)
when run under inductor: it causes an exception to be raised [here][1]
as expected.

The test as a whole still cannot be made to pass under inductor because
the [last assert][2] still fails, likely due to #108798.

[1]: dec2b267d4/test/test_torch.py (L6049)
[2]: dec2b267d4/test/test_torch.py (L6051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108826
Approved by: https://github.com/eellison
2023-09-13 13:06:26 +00:00
Edward Z. Yang
9f37aec964 Add torch._check_is_size (#108685)
Check comments for what it does.  The key distinction is that if
you feed it an unbacked SymInt, we will also apply >= 2 assumption
at compile time.

This will get exercised when I reland
https://github.com/pytorch/pytorch/pull/107788

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108685
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-09-07 12:48:39 +00:00
Huy Do
5a4fe05a15 Revert "Force synced KJT to trace unbacked SymInt (#107788)" (#108684)
This reverts commit 3b92ef814d.  So let's manually revert it instead.

(Not sure why the bot doesn't work on https://github.com/pytorch/pytorch/pull/107788)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108684
Approved by: https://github.com/ezyang
2023-09-06 19:15:45 +00:00
Kimish Patel
ebed490c2f [sdpa decomp] change sdpa decomp to be consistent with flash attention (#108608)
Summary: See the comment in code for the reasons of the change

Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch

Differential Revision: D48992180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
2023-09-06 15:34:03 +00:00
Edward Z. Yang
3b92ef814d Force synced KJT to trace unbacked SymInt (#107788)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107788
Approved by: https://github.com/voznesenskym
2023-09-06 03:18:26 +00:00
Kimish Patel
cc50e654d4 [aten decomp] Update sdpa decom (#108371)
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.

However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?

This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.

Test Plan:
test_model_exports_to_core_aten

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
2023-09-03 15:17:08 +00:00
lezcano
239ee76177 Add refs/decomps for dot/vdot (#108194)
Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194
Approved by: https://github.com/peterbell10
ghstack dependencies: #108188
2023-08-31 15:30:23 +00:00
rzou
0e4752bafc Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080)
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
  class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
  ProxyTorchDispatchMode handling in order to resolve decompositions. We
  can change this in the future so that they do not need to do this.

Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
2023-08-31 03:15:38 +00:00
Mengwei Liu
0fb1c05c5a [pytorch] Add decomp rule for scaled_dot_product_attention (#108180)
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.

However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.

Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
2023-08-30 15:52:08 +00:00
vfdev-5
0cfc5899f9 [inductor] Improved grid_sampler_2d decomposition for cuda (#104710)
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to https://github.com/pytorch/pytorch/issues/104296

Perfs:
- speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104710
Approved by: https://github.com/lezcano
2023-08-29 05:54:24 +00:00
Sam Larsen
20f3808aa2 Implement decomposition for aten.tensor_split.tensor_indices_or_sections (#107251)
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.

Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-08-28 17:01:23 +00:00
Vishwa Raj Singh
35de780aa6 Fix Inplace tensor update on transpose (#104689)
Fixes #https://github.com/pytorch/pytorch/issues/103650

- To align with HPU device backend architecture.
   Ensure all non-view ops return contiguous fake tensor outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689
Approved by: https://github.com/ezyang
2023-08-24 16:58:50 +00:00
Andrew Or
64d5851b1f make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)
Summary:
(From Brian Hirsh)

Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329

So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190)

There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops.

QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key.

If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key.

This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm.

Test Plan: python test/test_decomp.py

Differential Revision: D48607575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791
Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad
2023-08-24 15:19:07 +00:00
Sherlock Huang
ee4b99cc3a Decomp for aten.dropout (#106274)
When exporting dropout with cpu tensor, we get following graph module
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format)
            bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9);  empty_memory_format = None
            div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9);  bernoulli_p = None
            mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar);  arg0_1 = div_scalar = None
            return (mul_tensor,)
```

In addition, if we export with eval() mode, we will have an empty graph.

However, when exporting with cuda tensor, we got
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True);  arg0_1 = None
            getitem: f32[512, 10] = native_dropout_default[0];  native_dropout_default = None
            return (getitem,)
```
and exporting under eval() mode will still have a dropout node in graph.

This PR make exporting with CPU tensor also produce aten.native_dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274
Approved by: https://github.com/ezyang
2023-08-23 21:12:37 +00:00
Edward Z. Yang
5673c0874c Use expect_true to make split with unbacked sizes work. (#106788)
This pattern shows up in torchrec KeyedJaggedTensor.  Most
of the change in this PR is mechanical: whenever we failed
an unbacked symint test due to just error checking, replace the
conditional with something that calls expect_true (e.g.,
torch._check or TORCH_SYM_CHECK).

Some of the changes are a bit more nuanced, I've commented on the PR
accordingly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788
Approved by: https://github.com/lezcano
ghstack dependencies: #106720
2023-08-15 20:31:30 +00:00
lezcano
2c5f96deac [Inductor] Make softshrink composite implicit (#107052)
The backward is pretty much equivalent to the one we had written

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039, #107051
2023-08-14 21:01:50 +00:00
lezcano
3b1254e800 Make hardshrink's decomp composite implicit (#107039)
The generated code is the same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038
2023-08-14 21:01:50 +00:00
Sam Larsen
e165938853 Implement decomposition for aten.rrelu_with_noise (#106812)
Test Plan:
* Primarily, added new test in test/test_decomp.py
* Updated existing tests, e.g., to NOT expect failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812
Approved by: https://github.com/eellison
2023-08-11 19:18:29 +00:00
Stephen Jia
8c8477e55a Add _unsafe_index decomp (#106814)
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.

Also add it to the list of core decompositions.

Test Plan: contbuild and OSS CI (similar to D40075277)

Differential Revision: D48163393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
2023-08-10 23:23:37 +00:00
vfdev-5
35a1913370 [inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition

Related to https://github.com/pytorch/pytorch/issues/104296

Fixes https://github.com/pytorch/pytorch/issues/105565

Perfs:
- speed-up on cuda with bilinear and nearest modes

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           7.467 (+-0.036)            |             11.905 (+-0.276)            |             13.391 (+-0.051)            |     1.125 (+-0.000)      |           7.343 (+-0.036)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           7.722 (+-0.168)            |             14.371 (+-0.035)            |             15.899 (+-0.038)            |     1.106 (+-0.000)      |           7.870 (+-0.043)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |           7.710 (+-0.051)            |             11.354 (+-0.053)            |             13.376 (+-0.045)            |     1.178 (+-0.000)      |           7.698 (+-0.061)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |           7.870 (+-0.050)            |             13.744 (+-0.237)            |             15.206 (+-0.102)            |     1.106 (+-0.000)      |           7.912 (+-0.039)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           4.738 (+-0.015)            |             4.508 (+-0.005)             |             6.566 (+-0.027)             |     1.456 (+-0.000)      |           4.630 (+-0.022)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           4.391 (+-0.010)            |             4.860 (+-0.390)             |             6.438 (+-0.047)             |     1.325 (+-0.000)      |           4.458 (+-0.010)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |           4.279 (+-0.008)            |             4.127 (+-0.010)             |             6.598 (+-0.709)             |     1.599 (+-0.000)      |           5.064 (+-0.025)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |           4.537 (+-0.010)            |             4.593 (+-0.006)             |             6.365 (+-0.104)             |     1.386 (+-0.000)      |           4.480 (+-0.011)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           26.411 (+-0.066)           |             62.275 (+-0.436)            |             64.486 (+-0.353)            |     1.035 (+-0.000)      |           26.210 (+-0.110)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           26.457 (+-0.096)           |             72.887 (+-0.247)            |             74.207 (+-0.337)            |     1.018 (+-0.000)      |           25.995 (+-0.120)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |           26.457 (+-0.086)           |             64.110 (+-0.233)            |             66.340 (+-0.406)            |     1.035 (+-0.000)      |           26.145 (+-0.085)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |           26.536 (+-0.094)           |             73.742 (+-0.483)            |             71.946 (+-0.460)            |     0.976 (+-0.000)      |           26.457 (+-0.166)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           91.971 (+-0.253)           |             90.570 (+-0.193)            |            137.206 (+-0.214)            |     1.515 (+-0.000)      |           84.280 (+-0.241)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           91.893 (+-0.361)           |             89.866 (+-0.170)            |            136.678 (+-0.471)            |     1.521 (+-0.000)      |           84.573 (+-0.214)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |          116.967 (+-0.481)           |            110.468 (+-0.326)            |            223.770 (+-0.334)            |     2.026 (+-0.000)      |          108.098 (+-0.392)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |          117.563 (+-0.546)           |            111.438 (+-0.212)            |            223.101 (+-0.350)            |     2.002 (+-0.000)      |          108.225 (+-0.395)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           80.706 (+-0.289)           |             70.525 (+-0.204)            |            143.697 (+-0.311)            |     2.038 (+-0.000)      |           74.485 (+-0.258)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           80.955 (+-0.208)           |             69.986 (+-0.250)            |            143.658 (+-0.244)            |     2.053 (+-0.000)      |           74.163 (+-0.238)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |          117.576 (+-0.435)           |             71.179 (+-0.412)            |            178.515 (+-0.539)            |     2.508 (+-0.000)      |          108.394 (+-0.473)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |          117.441 (+-0.205)           |             70.313 (+-0.170)            |            178.664 (+-0.555)            |     2.541 (+-0.000)      |          108.098 (+-0.416)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           92.962 (+-0.509)           |            1740.964 (+-0.597)           |            1785.401 (+-0.369)           |     1.026 (+-0.000)      |           92.638 (+-0.539)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           92.928 (+-0.493)           |            1401.146 (+-0.732)           |            1453.229 (+-0.628)           |     1.037 (+-0.000)      |           92.458 (+-0.428)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |          118.152 (+-0.442)           |            1740.644 (+-0.480)           |            1793.475 (+-0.458)           |     1.030 (+-0.000)      |          107.962 (+-0.548)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |          118.182 (+-0.425)           |            1400.621 (+-0.624)           |            1461.796 (+-0.630)           |     1.044 (+-0.000)      |          107.894 (+-0.994)

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano
2023-08-10 09:52:48 +00:00
Andy Rock
aa1b2f16c5 fix upsample_nearest decompositions for uint8 tensors (#106675)
Fixes #106674.

This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: 68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675
Approved by: https://github.com/albanD
2023-08-07 01:52:41 +00:00
Kshiteej K
a899333ffc fix: nll_loss batch rule with negative ignore_idx (#106118)
We use python decompositions instead of writing our own for batching rules.

Fixes https://github.com/pytorch/pytorch/issues/105736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118
Approved by: https://github.com/lezcano, https://github.com/zou3519
2023-08-04 07:43:02 +00:00
chunyuan
cb6c3cbc91 inductor: enable weight prepack for LSTM (#103071)
- Enabled LSTM weight prepack in inductor.
- Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different.
- Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input.

**Update 2023-07-26:**
- https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in 3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)).
LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level.
- Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module.
When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here:
76fb72e24a/torch/nn/utils/stateless.py (L125-L128)
The forward function of LSTM will be called:
76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)
In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`:
76fb72e24a/torch/nn/modules/rnn.py (L775-L778)
The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here:
76fb72e24a/torch/nn/utils/stateless.py (L130-L142)
But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698))).
At this moment, if we need to recompile the model, `deepcopy` will be called:
76fb72e24a/torch/_dynamo/utils.py (L915-L917)
And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect:
76fb72e24a/torch/_subclasses/meta_utils.py (L514)
Added a fix in the `__get_state__`  to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-28 13:54:32 +00:00
lezcano
36ae359655 Update matmul decomp to match eager (#105850)
The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850
Approved by: https://github.com/Chillee
2023-07-26 09:24:51 +00:00
Nikita Karetnikov
45e4706aff [pt2] add decomps for multilabel_margin_loss_forward ops (#105302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Yanbo Liang
8daed86e4e [Inductor] aten.dist decomposition (#105586)
Fixes #105557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586
Approved by: https://github.com/desertfire, https://github.com/Chillee
2023-07-20 06:42:44 +00:00
Justin Chu
8a688277a2 [BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432
Approved by: https://github.com/ezyang
2023-07-19 13:48:44 +00:00
QSHLGZ
07108ff1e8 Fix typos under _decomp directory (#105210)
Fix typos under _decomp directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210
Approved by: https://github.com/ezyang, https://github.com/Neilblaze
2023-07-17 11:41:30 +00:00
Nikita Karetnikov
7e72126487 [pt2] add decomps for multi_margin_loss ops (#104578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-07-14 21:16:09 +00:00
Adnan Akhundov
4911b80b8e [inductor] addmm + ReLU / GELU fusion pass (#104132)
Summary:

Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.

Test Plan:

$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v

(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
  warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok

----------------------------------------------------------------------
Ran 1 test in 13.415s

OK

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
2023-07-10 16:44:14 +00:00
Jerry Zhang
1a661639f7 [quant] Support integer implementations for adaptive_avg_pool2d (#104226)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-07-07 19:36:31 +00:00
XiaobingSuper
d3589c9456 reduce computation of batch_norm when weight or bias is none (#104616)
For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616
Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5
2023-07-06 00:47:41 +00:00