Commit Graph

535 Commits

Author SHA1 Message Date
SS-JIA
7de669f2f9 [core IR] Remove trunc decomp and add trunc to core (#109902)
Following up from [this comment](https://github.com/pytorch/pytorch/pull/109319#discussion_r1330803226). Remove the decomposition for `trunc`, and add it as a core operator.

Going forward, provide similar treatment for operators that map cleanly to hardware instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109902
Approved by: https://github.com/peterbell10
2023-09-25 18:18:06 +00:00
Jijie Wei
334ead04a9 Back out "[decomp] Fix baddbmm decomposition (#109714)" (#109855)
Summary:
Original commit changeset: 95c462a380c9

Original Phabricator Diff: D49484954

this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus'

https://www.internalfb.com/intern/testinfra/testrun/16888498605839953

Differential Revision: D49527271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855
Approved by: https://github.com/yanboliang
2023-09-22 22:01:38 +00:00
Mwiza Kunda
8dedc9dd9b Add meta tests for layer/group/batch norm backward (#109591)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109591
Approved by: https://github.com/ezyang
2023-09-21 18:58:51 +00:00
Mwiza Kunda
6b7b9c796e Fix registering jit decompositions for jvp for out wrapped decomps (#109367)
Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since:
- `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter
- `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper

The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367
Approved by: https://github.com/lezcano
2023-09-21 16:36:51 +00:00
Peter Bell
6f0cf5a837 [decomp] Decompose unsafe_split{,_with_sizes} into safe variants (#109668)
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
2023-09-20 18:45:56 +00:00
Peter Bell
9e629dd73c [decomp] Add all std and std_mean overloads to core decompostions (#109667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109667
Approved by: https://github.com/lezcano
2023-09-20 18:45:56 +00:00
Peter Bell
36a8105f54 [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-20 18:40:21 +00:00
Salil Desai
40b2c796dc [Decomposition] baddbmm (#108534)
Summary:
Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions

ff38c0e2f9/torch/_inductor/decomposition.py (L203)

Test Plan: Phabricator + OSS Tests

Differential Revision: D48871741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534
Approved by: https://github.com/SherlockNoMad
2023-09-20 12:49:32 +00:00
Salil Desai
d0cc623192 [Decomposition] _unsafe_view (#108713)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[9d5eabd7b213d1a356d4e7bb400355d574ea924b]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=3091

Differential Revision: D48619079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108713
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-19 13:37:35 +00:00
Salil Desai
2e721aab98 [Decomposition] Trunc (#109319)
Summary:
Add Decomp for Trunc and add it to core_aten_decompositions

Differential Revision: D49042033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109319
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:30:13 +00:00
Salil Desai
ae66d0b3bf [Decomposition] clamp_max (#108718)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1855

Differential Revision: D48880026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108718
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:25:35 +00:00
Salil Desai
fc47ba2794 [Decomposition] clamp_min (#108717)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1846

Differential Revision: D48880080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108717
Approved by: https://github.com/SherlockNoMad
2023-09-18 12:43:58 +00:00
Salil Desai
a6d4cca7c0 [Decomposition] unsafe_split.Tensor (#108544)
Summary:
Include decomp in core_aten_decompositions

Decomp already exists

https://www.internalfb.com/code/fbsource/[03ff511cad587fc27ed8fd6a54b87845246e8e0c]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=1209

Test Plan: OSS + Phabricator Tests

Differential Revision: D48940445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108544
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 12:43:07 +00:00
Salil Desai
af93b29c5e [Decomposition] std.correction (#108733)
Summary:
Include decomp in core_aten_decompositions

Decomp:
https://www.internalfb.com/code/fbsource/[e69bf00ff87a55c9a30bd7905881661ff05fa211]/fbcode/caffe2/torch/_refs/__init__.py?lines=2398

Differential Revision: D48940402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108733
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 11:38:23 +00:00
Jez Ng
db48bc80d9 Check index size during decomp of index_add (#108826)
This partially fixes the `test_index_add_correctness` test (#108181)
when run under inductor: it causes an exception to be raised [here][1]
as expected.

The test as a whole still cannot be made to pass under inductor because
the [last assert][2] still fails, likely due to #108798.

[1]: dec2b267d4/test/test_torch.py (L6049)
[2]: dec2b267d4/test/test_torch.py (L6051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108826
Approved by: https://github.com/eellison
2023-09-13 13:06:26 +00:00
Ken Jin
c458fa0d35 Decompose/add reference for view_as_complex (#108005)
Aten source: d4a99631dd/aten/src/ATen/native/ComplexHelper.h (L78)

Documentation reference:
https://pytorch.org/docs/stable/generated/torch.view_as_complex.html

Note: this adds a new primitive `view_of_dtype`, which is trivially implemented, as its meta function is already implemented elsewhere.

Finally, this is not registered as a decomposition (yet), because TorchInductor does not yet support complex types. It should be added once we do.

Closes https://github.com/pytorch/pytorch/issues/108020 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108005
Approved by: https://github.com/peterbell10, https://github.com/ezyang
2023-09-07 23:49:20 +00:00
Edward Z. Yang
9f37aec964 Add torch._check_is_size (#108685)
Check comments for what it does.  The key distinction is that if
you feed it an unbacked SymInt, we will also apply >= 2 assumption
at compile time.

This will get exercised when I reland
https://github.com/pytorch/pytorch/pull/107788

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108685
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-09-07 12:48:39 +00:00
Sam Larsen
27fe45eaf6 [inductor][easy] Enable Mypy Checking for torch/_inductor/decomposition.py (#108682)
Summary: Looks like one simple type mismatch between `get_decompositions()` and `remove_decompositions()`

Test Plan: `lintrunner torch/_inductor/decomposition.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108682
Approved by: https://github.com/eellison
2023-09-07 00:48:55 +00:00
Huy Do
5a4fe05a15 Revert "Force synced KJT to trace unbacked SymInt (#107788)" (#108684)
This reverts commit 3b92ef814d.  So let's manually revert it instead.

(Not sure why the bot doesn't work on https://github.com/pytorch/pytorch/pull/107788)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108684
Approved by: https://github.com/ezyang
2023-09-06 19:15:45 +00:00
Kimish Patel
ebed490c2f [sdpa decomp] change sdpa decomp to be consistent with flash attention (#108608)
Summary: See the comment in code for the reasons of the change

Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch

Differential Revision: D48992180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
2023-09-06 15:34:03 +00:00
Edward Z. Yang
3b92ef814d Force synced KJT to trace unbacked SymInt (#107788)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107788
Approved by: https://github.com/voznesenskym
2023-09-06 03:18:26 +00:00
Kimish Patel
cc50e654d4 [aten decomp] Update sdpa decom (#108371)
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.

However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?

This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.

Test Plan:
test_model_exports_to_core_aten

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
2023-09-03 15:17:08 +00:00
lezcano
239ee76177 Add refs/decomps for dot/vdot (#108194)
Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194
Approved by: https://github.com/peterbell10
ghstack dependencies: #108188
2023-08-31 15:30:23 +00:00
rzou
0e4752bafc Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080)
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
  class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
  ProxyTorchDispatchMode handling in order to resolve decompositions. We
  can change this in the future so that they do not need to do this.

Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
2023-08-31 03:15:38 +00:00
chilli
39130c7433 Add reinplacing pass for scatters + incremental fake tensor updating (#106192)
mutation for params)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192
Approved by: https://github.com/jansel, https://github.com/eellison
2023-08-30 20:41:37 +00:00
Mengwei Liu
0fb1c05c5a [pytorch] Add decomp rule for scaled_dot_product_attention (#108180)
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.

However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.

Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
2023-08-30 15:52:08 +00:00
vfdev-5
0cfc5899f9 [inductor] Improved grid_sampler_2d decomposition for cuda (#104710)
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to https://github.com/pytorch/pytorch/issues/104296

Perfs:
- speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104710
Approved by: https://github.com/lezcano
2023-08-29 05:54:24 +00:00
Sam Larsen
20f3808aa2 Implement decomposition for aten.tensor_split.tensor_indices_or_sections (#107251)
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.

Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-08-28 17:01:23 +00:00
ssjia
86f9fec3ac Avoid decomposing _unsafe_index in Inductor (#107882)
`_unsafe_index` was previously added to the core ATen decomp table in https://github.com/pytorch/pytorch/pull/106814, but this has performance ramifications for Inductor. Therefore, this diff removes it from the decomposition table used by Inductor.

Differential Revision: [D48649210](https://our.internmc.facebook.com/intern/diff/D48649210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107882
Approved by: https://github.com/SherlockNoMad
2023-08-25 04:51:53 +00:00
Vishwa Raj Singh
35de780aa6 Fix Inplace tensor update on transpose (#104689)
Fixes #https://github.com/pytorch/pytorch/issues/103650

- To align with HPU device backend architecture.
   Ensure all non-view ops return contiguous fake tensor outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689
Approved by: https://github.com/ezyang
2023-08-24 16:58:50 +00:00
Andrew Or
64d5851b1f make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)
Summary:
(From Brian Hirsh)

Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329

So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190)

There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops.

QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key.

If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key.

This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm.

Test Plan: python test/test_decomp.py

Differential Revision: D48607575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791
Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad
2023-08-24 15:19:07 +00:00
Sherlock Huang
ee4b99cc3a Decomp for aten.dropout (#106274)
When exporting dropout with cpu tensor, we get following graph module
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format)
            bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9);  empty_memory_format = None
            div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9);  bernoulli_p = None
            mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar);  arg0_1 = div_scalar = None
            return (mul_tensor,)
```

In addition, if we export with eval() mode, we will have an empty graph.

However, when exporting with cuda tensor, we got
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True);  arg0_1 = None
            getitem: f32[512, 10] = native_dropout_default[0];  native_dropout_default = None
            return (getitem,)
```
and exporting under eval() mode will still have a dropout node in graph.

This PR make exporting with CPU tensor also produce aten.native_dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274
Approved by: https://github.com/ezyang
2023-08-23 21:12:37 +00:00
Edward Z. Yang
5673c0874c Use expect_true to make split with unbacked sizes work. (#106788)
This pattern shows up in torchrec KeyedJaggedTensor.  Most
of the change in this PR is mechanical: whenever we failed
an unbacked symint test due to just error checking, replace the
conditional with something that calls expect_true (e.g.,
torch._check or TORCH_SYM_CHECK).

Some of the changes are a bit more nuanced, I've commented on the PR
accordingly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788
Approved by: https://github.com/lezcano
ghstack dependencies: #106720
2023-08-15 20:31:30 +00:00
lezcano
2c5f96deac [Inductor] Make softshrink composite implicit (#107052)
The backward is pretty much equivalent to the one we had written

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039, #107051
2023-08-14 21:01:50 +00:00
lezcano
3b1254e800 Make hardshrink's decomp composite implicit (#107039)
The generated code is the same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038
2023-08-14 21:01:50 +00:00
Sam Larsen
e165938853 Implement decomposition for aten.rrelu_with_noise (#106812)
Test Plan:
* Primarily, added new test in test/test_decomp.py
* Updated existing tests, e.g., to NOT expect failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812
Approved by: https://github.com/eellison
2023-08-11 19:18:29 +00:00
Stephen Jia
8c8477e55a Add _unsafe_index decomp (#106814)
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.

Also add it to the list of core decompositions.

Test Plan: contbuild and OSS CI (similar to D40075277)

Differential Revision: D48163393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
2023-08-10 23:23:37 +00:00
vfdev-5
35a1913370 [inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition

Related to https://github.com/pytorch/pytorch/issues/104296

Fixes https://github.com/pytorch/pytorch/issues/105565

Perfs:
- speed-up on cuda with bilinear and nearest modes

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           7.467 (+-0.036)            |             11.905 (+-0.276)            |             13.391 (+-0.051)            |     1.125 (+-0.000)      |           7.343 (+-0.036)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           7.722 (+-0.168)            |             14.371 (+-0.035)            |             15.899 (+-0.038)            |     1.106 (+-0.000)      |           7.870 (+-0.043)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |           7.710 (+-0.051)            |             11.354 (+-0.053)            |             13.376 (+-0.045)            |     1.178 (+-0.000)      |           7.698 (+-0.061)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |           7.870 (+-0.050)            |             13.744 (+-0.237)            |             15.206 (+-0.102)            |     1.106 (+-0.000)      |           7.912 (+-0.039)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           4.738 (+-0.015)            |             4.508 (+-0.005)             |             6.566 (+-0.027)             |     1.456 (+-0.000)      |           4.630 (+-0.022)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           4.391 (+-0.010)            |             4.860 (+-0.390)             |             6.438 (+-0.047)             |     1.325 (+-0.000)      |           4.458 (+-0.010)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |           4.279 (+-0.008)            |             4.127 (+-0.010)             |             6.598 (+-0.709)             |     1.599 (+-0.000)      |           5.064 (+-0.025)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |           4.537 (+-0.010)            |             4.593 (+-0.006)             |             6.365 (+-0.104)             |     1.386 (+-0.000)      |           4.480 (+-0.011)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           26.411 (+-0.066)           |             62.275 (+-0.436)            |             64.486 (+-0.353)            |     1.035 (+-0.000)      |           26.210 (+-0.110)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           26.457 (+-0.096)           |             72.887 (+-0.247)            |             74.207 (+-0.337)            |     1.018 (+-0.000)      |           25.995 (+-0.120)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |           26.457 (+-0.086)           |             64.110 (+-0.233)            |             66.340 (+-0.406)            |     1.035 (+-0.000)      |           26.145 (+-0.085)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |           26.536 (+-0.094)           |             73.742 (+-0.483)            |             71.946 (+-0.460)            |     0.976 (+-0.000)      |           26.457 (+-0.166)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           91.971 (+-0.253)           |             90.570 (+-0.193)            |            137.206 (+-0.214)            |     1.515 (+-0.000)      |           84.280 (+-0.241)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           91.893 (+-0.361)           |             89.866 (+-0.170)            |            136.678 (+-0.471)            |     1.521 (+-0.000)      |           84.573 (+-0.214)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |          116.967 (+-0.481)           |            110.468 (+-0.326)            |            223.770 (+-0.334)            |     2.026 (+-0.000)      |          108.098 (+-0.392)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |          117.563 (+-0.546)           |            111.438 (+-0.212)            |            223.101 (+-0.350)            |     2.002 (+-0.000)      |          108.225 (+-0.395)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           80.706 (+-0.289)           |             70.525 (+-0.204)            |            143.697 (+-0.311)            |     2.038 (+-0.000)      |           74.485 (+-0.258)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           80.955 (+-0.208)           |             69.986 (+-0.250)            |            143.658 (+-0.244)            |     2.053 (+-0.000)      |           74.163 (+-0.238)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |          117.576 (+-0.435)           |             71.179 (+-0.412)            |            178.515 (+-0.539)            |     2.508 (+-0.000)      |          108.394 (+-0.473)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |          117.441 (+-0.205)           |             70.313 (+-0.170)            |            178.664 (+-0.555)            |     2.541 (+-0.000)      |          108.098 (+-0.416)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           92.962 (+-0.509)           |            1740.964 (+-0.597)           |            1785.401 (+-0.369)           |     1.026 (+-0.000)      |           92.638 (+-0.539)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           92.928 (+-0.493)           |            1401.146 (+-0.732)           |            1453.229 (+-0.628)           |     1.037 (+-0.000)      |           92.458 (+-0.428)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |          118.152 (+-0.442)           |            1740.644 (+-0.480)           |            1793.475 (+-0.458)           |     1.030 (+-0.000)      |          107.962 (+-0.548)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |          118.182 (+-0.425)           |            1400.621 (+-0.624)           |            1461.796 (+-0.630)           |     1.044 (+-0.000)      |          107.894 (+-0.994)

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano
2023-08-10 09:52:48 +00:00
Andy Rock
aa1b2f16c5 fix upsample_nearest decompositions for uint8 tensors (#106675)
Fixes #106674.

This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: 68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675
Approved by: https://github.com/albanD
2023-08-07 01:52:41 +00:00
Kshiteej K
a899333ffc fix: nll_loss batch rule with negative ignore_idx (#106118)
We use python decompositions instead of writing our own for batching rules.

Fixes https://github.com/pytorch/pytorch/issues/105736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118
Approved by: https://github.com/lezcano, https://github.com/zou3519
2023-08-04 07:43:02 +00:00
chunyuan
cb6c3cbc91 inductor: enable weight prepack for LSTM (#103071)
- Enabled LSTM weight prepack in inductor.
- Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different.
- Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input.

**Update 2023-07-26:**
- https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in 3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)).
LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level.
- Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module.
When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here:
76fb72e24a/torch/nn/utils/stateless.py (L125-L128)
The forward function of LSTM will be called:
76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)
In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`:
76fb72e24a/torch/nn/modules/rnn.py (L775-L778)
The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here:
76fb72e24a/torch/nn/utils/stateless.py (L130-L142)
But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698))).
At this moment, if we need to recompile the model, `deepcopy` will be called:
76fb72e24a/torch/_dynamo/utils.py (L915-L917)
And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect:
76fb72e24a/torch/_subclasses/meta_utils.py (L514)
Added a fix in the `__get_state__`  to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-28 13:54:32 +00:00
lezcano
36ae359655 Update matmul decomp to match eager (#105850)
The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850
Approved by: https://github.com/Chillee
2023-07-26 09:24:51 +00:00
Nikita Karetnikov
45e4706aff [pt2] add decomps for multilabel_margin_loss_forward ops (#105302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
angelayi
fed8d3608d Update core aten decomp table (#105673)
Updated the decomposition table based on the existing [Core ATen IR](https://pytorch.org/docs/stable/ir.html) list, and moved rest of decompositions to inductor's decomposition table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105673
Approved by: https://github.com/SherlockNoMad
2023-07-21 02:45:37 +00:00
Yanbo Liang
8daed86e4e [Inductor] aten.dist decomposition (#105586)
Fixes #105557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586
Approved by: https://github.com/desertfire, https://github.com/Chillee
2023-07-20 06:42:44 +00:00
Justin Chu
8a688277a2 [BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432
Approved by: https://github.com/ezyang
2023-07-19 13:48:44 +00:00
QSHLGZ
07108ff1e8 Fix typos under _decomp directory (#105210)
Fix typos under _decomp directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210
Approved by: https://github.com/ezyang, https://github.com/Neilblaze
2023-07-17 11:41:30 +00:00
Peter Bell
9adfaf8807 [inductor] Add lowering for aten.unfold (#105165)
The decomposition for unfold uses `as_strided` which forces the input to be
realized. Instead, this implements it as a `GenericView` with reindexing
which removes the need to realize, though it does call `mark_reuse` incase
the input computation is expensive and the windows overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105165
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-16 13:09:23 +00:00
William Wen
5cd861fcf7 Add empty/empty_like to core aten decomps (#105158)
Fixes https://github.com/pytorch/pytorch/issues/104871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105158
Approved by: https://github.com/SherlockNoMad
2023-07-15 18:48:55 +00:00
Nikita Karetnikov
7e72126487 [pt2] add decomps for multi_margin_loss ops (#104578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-07-14 21:16:09 +00:00
Adnan Akhundov
4911b80b8e [inductor] addmm + ReLU / GELU fusion pass (#104132)
Summary:

Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.

Test Plan:

$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v

(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
  warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok

----------------------------------------------------------------------
Ran 1 test in 13.415s

OK

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
2023-07-10 16:44:14 +00:00
Jerry Zhang
1a661639f7 [quant] Support integer implementations for adaptive_avg_pool2d (#104226)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-07-07 19:36:31 +00:00
XiaobingSuper
d3589c9456 reduce computation of batch_norm when weight or bias is none (#104616)
For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616
Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5
2023-07-06 00:47:41 +00:00
Peter Bell
5c580a9846 [decomp] Add test tracking core ATen operators (#104262)
This adds an expect-test that finds the set of core ATen operators by
subtracting the operators with decomposition in core_aten_decompositions from the
set of all operators that have decompositions and could be decomposed.

This is useful because if you add a new decomposition but forget to add it to
the list of core decompositions, it will appear in the PR diff.

Also, by going through this list I have identified some operators where the
functional variant is decomposed, but not the inplace variant which must be an
oversight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104262
Approved by: https://github.com/lezcano
2023-07-04 16:41:44 +00:00
David Berard
0b62aca726 Don't decompose aten.bucketize (#104396)
torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4].

The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes.

Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104396
Approved by: https://github.com/Chillee
2023-06-30 05:05:08 +00:00
Peter Bell
8b418f197c [decomp] Add decomposition for torch.renorm (#103858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858
Approved by: https://github.com/ezyang, https://github.com/nkaretnikov
2023-06-21 20:57:43 +00:00
Peter Bell
591981c5e2 [inductor] Lower diagonal, diagonal_copy and diagonal_scatter (#103755)
Currently these are decomposed into `as_strided`, which forces a buffer to be
realized. Instead, this lowers them into a native inductor view node and so
doesn't require any buffers to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103755
Approved by: https://github.com/jansel
2023-06-21 20:16:24 +00:00
Peter Bell
a61096fb94 [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-21 20:16:24 +00:00
Peter Bell
61cd605813 [decomp] Don't call .item() in aten.fill.Tensor decomp (#103880)
Currently calling the fill.Tensor overload under `torch.compile` results in a
`DataDependentOutputException` due to the `.item()` call. This instead does a
device-device copy which can then be inlined into subsequent inductor kernels as
you would expect, e.g.

```python
def fn(a):
    result = torch.deg2rad(a).sin()
    return torch.empty((128, 128), device=a.device).fill_(result)
```

generates the single kernel
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16384
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset  + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (0))
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
    tmp2 = 0.017453292519943295
    tmp3 = tmp1 * tmp2
    tmp4 = tl.sin(tmp3)
    tl.store(out_ptr0 + (x0), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880
Approved by: https://github.com/Chillee
2023-06-21 18:45:04 +00:00
Kurt Mohler
ee83c646bb Replace _prims_common.check with torch._check* (#103240)
This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check*` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240
Approved by: https://github.com/albanD
2023-06-21 00:46:17 +00:00
PyTorch MergeBot
7b6dc72ffa Revert "[decomp] Decompose logaddexp2 (#103765)"
This reverts commit bab21d20eb.

Reverted https://github.com/pytorch/pytorch/pull/103765 on behalf of https://github.com/ezyang due to looks like land race ([comment](https://github.com/pytorch/pytorch/pull/103765#issuecomment-1599030496))
2023-06-20 15:35:02 +00:00
Peter Bell
bab21d20eb [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-20 09:24:21 +00:00
Ivan Zaitsev
821493715c Back out "Remove check from _prims_common, replace with torch._check* (#102219)", Back out "Forwatd fix for D46427687" (#103128)
Test Plan: revertitparrot

Reviewed By: malfet

Differential Revision: D46506433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128
Approved by: https://github.com/malfet
2023-06-07 01:41:41 +00:00
Kurt Mohler
a84bb2709a Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-03 02:23:21 +00:00
PyTorch MergeBot
a7efa0ce35 Revert "Remove check from _prims_common, replace with torch._check* (#102219)"
This reverts commit fb79d43649.

Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))
2023-06-02 20:00:48 +00:00
Kurt Mohler
fb79d43649 Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-02 19:13:45 +00:00
Aleksandar Samardžić
51e0f9e858 Add missing decompositons/lowerings for logical/bitwise operators (#102566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566
Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5
2023-06-02 14:27:17 +00:00
Nikita Karetnikov
c3ea8cc58b [pt2] convert out params in register_meta (#101344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101344
Approved by: https://github.com/lezcano
2023-05-27 18:38:52 +00:00
Animesh Jain
c2093de5d9 [partitioner] fix for rng ops (#102123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102123
Approved by: https://github.com/Chillee
2023-05-25 00:35:07 +00:00
Peter Bell
ce42010722 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-24 22:17:32 +00:00
vfdev-5
e3d97b6213 [inductor] Added smooth_l1_loss refs (#102077)
Added `smooth_l1_loss` to refs + tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102077
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-24 15:07:08 +00:00
Matthew Hoffman
29da75cc55 Enable mypy allow redefinition (#102046)
Related #101528

I tried to enable this in another PR but it uncovered a bunch of type errors: https://github.com/pytorch/pytorch/actions/runs/4999748262/jobs/8956555243?pr=101528#step:10:1305

The goal of this PR is to fix these errors.

---

This PR enables [allow_redefinition = True](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern:

> Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition.

`allow_redefinition` allows mypy to be more flexible by allowing reassignment to an existing variable with a different type... for instance (from the linked PR):

4a1e9230ba/torch/nn/parallel/data_parallel.py (L213)

A `Sequence[Union[int, torch.device]]` is narrowed to `Sequence[int]` thru reassignment to the same variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102046
Approved by: https://github.com/ezyang
2023-05-24 07:05:30 +00:00
PyTorch MergeBot
5147fe4969 Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)"
This reverts commit b9721bd705.

Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))
2023-05-23 23:06:21 +00:00
Peter Bell
b9721bd705 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-22 20:39:18 +00:00
Jason Ansel
0c6f409cda [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-20 03:43:33 +00:00
lezcano
1930428d89 Minor improvement on the decomposition of upsample_bilinear (#101682)
This is how it's done in core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101682
Approved by: https://github.com/ngimel
2023-05-18 16:51:51 +00:00
Peter Bell
66e398951a [inductor/decomp] Add aten._unsafe_index to disable range checks (#101602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-17 23:36:24 +00:00
PyTorch MergeBot
5f07c589b0 Revert "[inductor] Refactor RNG operators (#100064)"
This reverts commit 3bbf0683a1.

Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))
2023-05-17 21:16:41 +00:00
Jason Ansel
3bbf0683a1 [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-17 01:29:31 +00:00
Thibaut Durand
01da732691 Fix type annotation of torch.split (#100655)
The type annotation indicates `list` but the returned type is `tuple`
```python
>>> import torch
>>> type(torch.arange(10).split(4))
<class 'tuple'>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100655
Approved by: https://github.com/kit1980
2023-05-16 21:35:41 +00:00
Jiong Gong
788ff0623b [decomp] fix decomp of batch_norm when weight/bias is not flattened (#101059)
Fix https://github.com/pytorch/pytorch/issues/100970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101059
Approved by: https://github.com/ezyang
2023-05-16 00:00:34 +00:00
Animesh Jain
e1021ec535 [decomp] Bad accuracy for elu_backward (#100284)
Accuracy is tested by the full model at https://github.com/pytorch/pytorch/issues/100061
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100284
Approved by: https://github.com/ngimel
2023-04-29 04:21:20 +00:00
Bin Bao
b66d7007d8 Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267)
Summary: https://github.com/pytorch/pytorch/pull/100242 didn't cover all
test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100267
Approved by: https://github.com/jansel
2023-04-28 19:32:17 +00:00
yhl48
07c02b9e92 Add vmap support for smooth_l1_loss_backward (#99429)
Follow-up of #98357
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99429
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-04-28 10:58:07 +00:00
Animesh Jain
a8ad0dc333 [philox_rand] Add decomps (#100206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100206
Approved by: https://github.com/ngimel
2023-04-28 02:20:13 +00:00
Angela Yi
d06b93b0c7 Decompose arange.default to arange.start_step (#99739)
The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739
Approved by: https://github.com/SherlockNoMad
2023-04-27 19:06:36 +00:00
Animesh Jain
539363a873 [inductor] Lowering of rngprims philox_rand (#99289)
An example graph with Dynamic shapes on

`arg0_1` is seed, `arg1_1` is base offset.
~~~
  ===== Forward graph 0 =====
 <eval_with_key>.5 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: f32[s0]):
        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4605, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0] = torch.ops.aten.mul.Tensor(getitem, arg3_1);  getitem = arg3_1 = None

        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4606, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        add_5: i64[] = torch.ops.aten.add.Tensor(add_4, 3);  add_4 = None
        div: i64[] = torch.ops.aten.div.Tensor_mode(add_5, 4, rounding_mode = 'floor');  add_5 = None
        mul_2: i64[] = torch.ops.aten.mul.Tensor(div, 4);  div = None
        return (mul_1, mul_2)

~~~

Note that the output `mul2` is basically total `numel` of the random ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99289
Approved by: https://github.com/jansel
2023-04-26 01:22:41 +00:00
Animesh Jain
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
XiaobingSuper
41069f2faa inductor: align inductor behavior with eager mode for split_with_sizes (#99702)
Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR:

```
Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node
    return node.target(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split
    return tensor.split(split_size_or_sections, dim)
  File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split
    return torch._VF.split_with_sizes(self, split_size, dim)
  File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch
    return decomposition_table[func](*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes
    raise ValueError(
ValueError: Split sizes don't add up to the tensor's size in the given dimension

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value
    return wrap_fake_exception(
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception
    return fn()
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node
    raise RuntimeError(
RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>(*(FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}):
Split sizes don't add up to the tensor's size in the given dimension
(scroll up for backtrace)

The above exception was the direct cause of the following exception:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel
2023-04-25 01:13:52 +00:00
Will Constable
63690afc6c Make CI error on inductor fallback when decomp is available (#99473)
Fixes #99446

Remove the warning, as that annoyed end-users who don't know what to do about it.

Instead, try to hold the line by preventing any decomp from being added without making
the corresponding change to inductor's fallbacks.

Note: we probably still need to better document how to update inductor's decomps,
for now it's pretty much "go ask the inductor team for advice"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99473
Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel
2023-04-21 05:47:28 +00:00
PyTorch MergeBot
5cb788a9a5 Revert "[cuda rng] Making offset calculation independent of device properties (#98988)"
This reverts commit 26f318574f.

Reverted https://github.com/pytorch/pytorch/pull/98988 on behalf of https://github.com/anijain2305 due to Diagnosing if sebotnet has flakiness
2023-04-19 17:23:40 +00:00
Animesh Jain
26f318574f [cuda rng] Making offset calculation independent of device properties (#98988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98988
Approved by: https://github.com/ngimel
2023-04-19 01:35:44 +00:00
Animesh Jain
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
Peter Bell
7b91bd2a7b [primTorch] Add count_nonzero (#98995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98995
Approved by: https://github.com/lezcano
2023-04-13 22:08:19 +00:00
Peter Bell
7d74dca780 [primTorch] Add rad2deg and deg2rad (#98994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98994
Approved by: https://github.com/lezcano
2023-04-13 22:08:19 +00:00
Nikita Karetnikov
ff825de442 [primTorch] add ref for cumprod (#98670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98670
Approved by: https://github.com/ezyang
2023-04-09 15:22:28 +00:00
albanD
0210481dcb Fix _like meta registrations (#98160)
The meta implementation for these _like function is wrong whenever device != "meta" (it doesn't fill the memory!).
zeros_like is special due to sparse and is fixed directly by always filling it with zeros.
Every other one is CompositeExplicit implementation, I went with removing their meta registration and tweaking code to avoid infinite recursions.
I can do the same as zeros_like (and add the proper filling for each) but that would duplicate the c++ logic and make the meta registrations non trivial. I can do it if you prefer to removal.

test_meta works fine with these fixes, relying on CI to see if other tests are breaking as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98160
Approved by: https://github.com/ezyang
2023-04-06 18:44:34 +00:00
Kiersten Stokes
cea13ad9fa Improve size mismatch error messaging referencing mat/vet sizes (#96863)
Fixes #94841

This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed!

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}",
```

Example output for `Blas.cpp` before:
```
size mismatch, got 3, 3x4,1
```

The new error messages have the following format:

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")");
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})",
```

Example output for `Blas.cpp` after:
```
size mismatch, got bias (3), matrix (3x4), vector (1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863
Approved by: https://github.com/albanD
2023-03-17 21:07:48 +00:00
Rohan Gupta
b01d6f2cdb addmv decomp #2 (#96264)
Fixes #94617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96264
Approved by: https://github.com/ngimel, https://github.com/ezyang
2023-03-16 23:09:45 +00:00
Christian Puhrsch
0a53c9624a Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)" (#96885)
Summary:
Backing out  _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885
Approved by: https://github.com/drisspg
2023-03-16 05:32:55 +00:00
mingfeima
6d62134f2c fix aminmax output resize issue when input is a zero dimension tensor (#96171)
Fix https://github.com/pytorch/pytorch/issues/96042

### before
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
__main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.)
torch.return_types.aminmax(
min=tensor([1]),
max=tensor([1]))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```
### after
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))

```

Marked the following test as expected_fail:
`test_vmap.py TestVmapOperatorsOpInfoCPU.test_op_has_batch_rule_aminmax_cpu_float32`

Given input shape of (2), the loop out is shape (2), the batched vmap out is (2, 1), which mismatched.
The loop out will calculate twice on a tensor shape of ( ): without this patch, the output is (1), and then stacked into (2, 1); with this patch, the output is ( ), then stacked into (2).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96171
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/zou3519
2023-03-15 22:44:13 +00:00
BowenBao
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
Jason Ansel
5dd52e250f [inductor] Add some simple decomps (#96039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96039
Approved by: https://github.com/ngimel
2023-03-05 17:07:56 +00:00
Natalia Gimelshein
43e71cddb0 [inductor] use triu ref instead of lowering (#96040)
Fixes #95958
Generated code is functionally identical with ref and lowering, only minor differences

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96040
Approved by: https://github.com/jansel
2023-03-05 07:24:34 +00:00
Jason Ansel
5da6da659a [inductor] Enable some decomps (#96038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96038
Approved by: https://github.com/ngimel
2023-03-05 02:03:35 +00:00
Natalia Gimelshein
3a7fd20108 fix nll loss decomposition to properly ignore ignore_index (#95833)
Fixes #95794
This is a hotfix for decomposition only (that is currently used by inductor), reference still accesses invalid indices. Perhaps `_nll_loss_nd` and this decomp should be unified, cc @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95833
Approved by: https://github.com/lezcano, https://github.com/Chillee
2023-03-02 08:37:56 +00:00
Brian Hirsh
ddd6b53d80 fix embedding_backward_dense decomp with broadcasting (#95499)
Fixes https://github.com/pytorch/pytorch/issues/95182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95499
Approved by: https://github.com/ezyang, https://github.com/ngimel
2023-02-28 00:24:40 +00:00
Christian Puhrsch
1fe2a9d122 Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-27 20:27:25 +00:00
Yanan Cao (PyTorch)
039b4c8809 Add meta function for _upsample_bilinear2d_aa (#94982)
Differential Revision: D43353000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94982
Approved by: https://github.com/ezyang
2023-02-19 07:11:20 +00:00
Brian Hirsh
68600fc7c6 avoid extra copies in batchnorm inference by introducing a new op, _native_batch_norm_legit_no_training (#94946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94946
Approved by: https://github.com/ezyang
2023-02-16 11:41:20 +00:00
Fabio Rocha
1dbaa5c290 Use decompositions for some fallbacks introduced in #94039 (#94206)
In some cases, implements required inductor primitives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-02-14 09:31:30 +00:00
Peter Bell
e22e323bea [decomp] Use var_mean in native_batch_norm decomposition (#94140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94140
Approved by: https://github.com/ngimel
2023-02-10 15:19:46 +00:00
Horace He
e844120b2f Fix embedding_dense_backward to not cast indiices to floats (#94572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94572
Approved by: https://github.com/ngimel
2023-02-10 12:44:03 +00:00
lezcano
fe0e28ab87 [decompositions] GRU decompositon with and without packed sequence (#91466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91466
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
5a7c1b7894 [decompositions] LSTM with packed input (#91465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91465
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
bef61225c3 [decompositions] add decomposition for RNN with packed sequence (#91281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
e5f6e1f660 [decompositions] add LSTM decomp (#91124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91124
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
lezcano
20d01d2dc9 [expanded weights] add RNN support via decomp (#91807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91807
Approved by: https://github.com/albanD
2023-02-08 14:16:30 +00:00
lezcano
c2a92687e0 [decompositions] add RNN decomp and testing (#91123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
Michael Lazos
d16c2c36ad Add another missing decomp (#94113)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94113
Approved by: https://github.com/jansel
2023-02-07 21:32:56 +00:00
Natalia Gimelshein
7bba87ed06 add rsub decomposition with alpha (#94144)
Fixes #93376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94144
Approved by: https://github.com/desertfire
2023-02-07 17:21:13 +00:00
Natalia Gimelshein
ea4cda5268 fix inductor clamp decomp to correctly type promote and avoid wrappin… (#94157)
…g scalars

Fixes #93784, #93225
Ideally, clamp decomp should live in refs or _decomp, but this reversed our current decomposition flow of `clamp_min` -> `clamp` -> lowering, so to keep changes to minimum, I'm leaving it in inductor for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94157
Approved by: https://github.com/ezyang
2023-02-06 05:36:19 +00:00
Natalia Gimelshein
8ecda19607 fix upsampling decompositions to have integer output sizes (#94123)
This allows unet to be compiled with symbolic shapes (but it still fails accuracy, lol).
Output sizes are always integer, there's no need to pretend they are ever float. Recomputing scale factors still used nominally float sizes converted to int, we might as well do it from the start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94123
Approved by: https://github.com/ezyang
2023-02-05 04:56:07 +00:00
Peter Bell
77acb556e6 [primTorch] Rewrite nan_to_num ref in terms of aten functions (#93952)
This de-duplicates `_refs.nan_to_num` with the inductor decomposition
and simplifies it to not reimplement `isnan`, `isposinf` and `isneginf`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93952
Approved by: https://github.com/lezcano
2023-02-03 13:51:37 +00:00
Peter Bell
72385bbd03 [primTorch] Rewrite is{,pos,neg}inf refs in terms of aten functions (#93951)
`isposinf` and `isneginf` currently fallback in inductor. Here, I
enable the existing decompositions to work with inductor.

`isinf` can also be written with aten functions, however I don't add
it to inductor's decompositions because `isinf` is lowered to
`tl.libdevice.isinf` in triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93951
Approved by: https://github.com/lezcano
2023-02-03 13:51:37 +00:00
XiaobingSuper
db87396474 inductor: align the decomposition output stride with none-decomposition path for torch.lerp (#93336)
As title, we need to align the decomposition output stride with the none-decomposition path for torch.lerp. And also enable it's lowering path for inductor.

After this PR for the following case:

```

def fn(i0, i1):
    # i0: (10, 3, 10)
    # i1: (3, 10, 10)
    x1 = i0.transpose(-2, -3)
    #y = torch.lerp(x1, x1, 70000)
    z = torch.lerp(i1, x1, 70000)
    return z

x0 = torch.rand(10, 3, 10)
x1 = torch.rand(3, 10, 10)
ret_eager = fn(x0, x1)
print('==== Eager mode OK! ====')
compiled = torch.compile(fn, fullgraph=True)
ret_compiled = compiled(x0, x1)
print('==== compile mode OK! ====')
ret_compiled = compiled(x0, x1)
print(torch.equal(ret_eager, ret_compiled))
print(ret_eager.stride()==ret_compiled.stride())
```

the inductor output code will be like(CPU):

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<3; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<10; i1+=1)
            {
                for(long i2=0; i2<0; i2+=1)
                {
                    auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr0 + (10*i0) + (16*i2) + (30*i1));
                    auto tmp8 = at::vec::Vectorized<float>::loadu(in_ptr1 + (10*i1) + (16*i2) + (100*i0));
                    auto tmp0 = at::vec::Vectorized<float>(static_cast<float>(70000.0));
                    auto tmp1 = tmp0.abs();
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.5));
                    auto tmp3 = tmp1 >= tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1));
                    auto tmp5 = tmp0 - tmp4;
                    auto tmp6 = decltype(tmp5)::blendv(tmp0, tmp5, tmp3);
                    auto tmp9 = tmp7 - tmp8;
                    auto tmp10 = tmp6 * tmp9;
                    auto tmp11 = decltype(tmp7)::blendv(tmp8, tmp7, tmp3);
                    auto tmp12 = tmp10 + tmp11;
                    tmp12.store(out_ptr0 + (10*i1) + (16*i2) + (100*i0));
                }
                #pragma omp simd simdlen(8)
                for(long i2=0; i2<10; i2+=1)
                {
                    auto tmp7 = in_ptr0[i2 + (10*i0) + (30*i1)];
                    auto tmp8 = in_ptr1[i2 + (10*i1) + (100*i0)];
                    auto tmp0 = static_cast<float>(70000.0);
                    auto tmp1 = std::abs(tmp0);
                    auto tmp2 = static_cast<float>(0.5);
                    auto tmp3 = tmp1 >= tmp2;
                    auto tmp4 = static_cast<float>(1);
                    auto tmp5 = tmp0 - tmp4;
                    auto tmp6 = tmp3 ? tmp5 : tmp0;
                    auto tmp9 = tmp7 - tmp8;
                    auto tmp10 = tmp6 * tmp9;
                    auto tmp11 = tmp3 ? tmp7 : tmp8;
                    auto tmp12 = tmp10 + tmp11;
                    out_ptr0[i2 + (10*i1) + (100*i0)] = tmp12;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf1 = empty_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    del arg1_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((10, 3, 10), (30, 10, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1]))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93336
Approved by: https://github.com/jansel
2023-02-02 07:40:28 +00:00
Sherlock Huang
6a7d6cc30d Introduce core_aten_decompositions (#93131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93131
Approved by: https://github.com/ngimel
2023-02-01 08:35:46 +00:00
Joel Schlosser
e5fd7e6d8f Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)
For the `crossvit_9_240` model - it works now with dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854
Approved by: https://github.com/ezyang
2023-01-25 05:08:02 +00:00
PyTorch MergeBot
01f1097770 Revert "Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)"
This reverts commit d49187bf88.

Reverted https://github.com/pytorch/pytorch/pull/92854 on behalf of https://github.com/malfet due to Resulted in 50+% flaky failures in dynamo, reverting
2023-01-25 00:10:14 +00:00
Joel Schlosser
d49187bf88 Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)
For the `crossvit_9_240` model - it works now with dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854
Approved by: https://github.com/ezyang
2023-01-24 21:36:17 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
8f3600b966 [RELAND] Add metadata coverage for unsafe_split and unsafe_split_with_sizes (#92802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92802
Approved by: https://github.com/soumith
2023-01-23 10:57:10 +00:00
PyTorch MergeBot
0d9de46d9c Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)"
This reverts commit 36e1f7bc2b.

Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ezyang due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (main.TestEagerFusionOpInfoCPU) is now xpass
2023-01-22 13:57:31 +00:00
Tugsbayasgalan Manlaibaatar
36e1f7bc2b Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608
Approved by: https://github.com/ngimel
2023-01-22 07:12:29 +00:00
Peter Bell
dd760c98f8 [decomp] Use new squeeze.dims overload in decompositions (#91602)
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
2023-01-20 18:08:18 +00:00
PyTorch MergeBot
2891cecd8d Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)"
This reverts commit 4386f317b9.

Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ZainRizvi due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (__main__.TestEagerFusionOpInfoCPU) is failing consistently since this PR was merged
2023-01-20 17:17:35 +00:00
Tugsbayasgalan Manlaibaatar
4386f317b9 Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608
Approved by: https://github.com/ngimel
2023-01-20 12:39:56 +00:00
lezcano
8b861544f9 Remove lowering and decompositions of zero_, zero, zeros_like... in favour of their references (#92071)
The generated triton code is identical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92071
Approved by: https://github.com/ngimel
2023-01-18 23:22:36 +00:00
Peter Bell
8770a7ed6f Decompose more inplace ops (#90967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90967
Approved by: https://github.com/anijain2305
2023-01-18 21:07:47 +00:00
Peter Bell
4058dedf21 Replace log(1 + x) with log1p(x) (#92114)
`log1p` offers better precision near zero since `(1 + x) - 1` truncates any
values less than the float epsilon to zero. For `soft_margin_loss` this also
requires one fewer kernel invocation which for numel=1e7 gives me a 1.2x speedup
on CUDA and a 1.1x speedup on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92114
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-01-18 10:43:56 +00:00
lezcano
da58f9eb8f Rewrite out-of-place decompositions in terms of out-of-place ops (#92003)
Fixes https://github.com/pytorch/torchdynamo/issues/1863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92003
Approved by: https://github.com/ngimel
2023-01-17 16:53:27 +00:00
vfdev-5
5f55335c2e Fixed output memory format mismatch for bicubic2d (#90470)
Description:

- output memory format is matching input for bicubic2d

Problem: output tensor's memory format does not match input format for bicubic2d

```python
import torch

i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
assert i.is_contiguous(memory_format=torch.channels_last)
o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic")
assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})"

> AssertionError: Should be channels last but given channels first (True)
```

Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh )

Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev

- Updated code to match grad input / output memory formats
- temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl`
- Updated tests
- Added missing forward AD support for bicubic with antialiasing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470
Approved by: https://github.com/NicolasHug, https://github.com/lezcano
2023-01-12 19:52:28 +00:00
min-jean-cho
af242eedfb [Inductor] Added aten.uniform_ decomp (#90869)
Fixes #90815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD
2023-01-11 23:23:42 +00:00
David Berard
d7dc1c2fd5 Support zero dimensions in softmax decompositions (#91322)
The eager implementation of softmax supports computation along zero dimensions, but many of the other implementations did not, including:
* decompositions & refs (this was causing dynamo failures)
* forward AD for logsumexp
* MPS log_softmax_backward

This PR handles the `input.numel() == 0` cases separately to avoid running `amax()`, which fails for zero dimensions, and updates opinfos.

example of "computation along zero dimensions":

```python
# example of where
import torch

t = torch.rand((4, 0, 0))
print("~")
print(torch.nn.functional.softmax(t, dim=-1))  # this passes
print("~")
torch._refs.softmax(t, dim=-1)  # this fails
print("~")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91322
Approved by: https://github.com/lezcano
2023-01-11 09:35:43 +00:00
XiaobingSuper
3790b50505 inductor: fix .to(memort_format) issue which doesn't generate right stride (#91948)
Motivation: for **.to(memory_format),** the inductor doesn't generate the right stride, see the following example:
```
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        x = x.to(memory_format=torch.contiguous_format)
        return x
```

the generated code doesn't do the memory format change and gets a wrong stride **(802816, 1, 14336, 256)**, it is not a contiguous stride.

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    return (arg0_1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

After this PR, the will have a memory format change:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<128; i0+=1)
            {
                #pragma GCC ivdep
                for(long i1=0; i1<256; i1+=1)
                {
                    #pragma GCC ivdep
                    for(long i2=0; i2<3136; i2+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (256*i2) + (802816*i0)];
                        out_ptr0[i2 + (3136*i1) + (802816*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf1 = empty_strided((128, 256, 56, 56), (802816, 3136, 56, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91948
Approved by: https://github.com/ngimel
2023-01-11 08:23:26 +00:00
min-jean-cho
364f526b9c [Inductor] assert generator for random, dropout (#91833)
See comment https://github.com/pytorch/pytorch/pull/90869#discussion_r1063731541 , https://github.com/pytorch/pytorch/pull/91673#discussion_r1061099337.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91833
Approved by: https://github.com/jansel
2023-01-11 03:24:10 +00:00
PyTorch MergeBot
43050b8301 Revert "[Inductor] Added aten.uniform_ decomp (#90869)"
This reverts commit c55293d640.

Reverted https://github.com/pytorch/pytorch/pull/90869 on behalf of https://github.com/huydhn due to Crossref error cannot just simply be ignored because it would break trunk for every commits after this, i.e. fd0030fe74.  The failure would need to be handled gracefully, i.e. adding an XFAIL for example
2023-01-11 01:18:11 +00:00
min-jean-cho
c55293d640 [Inductor] Added aten.uniform_ decomp (#90869)
Fixes #90815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD
2023-01-10 23:05:01 +00:00
Nikita Karetnikov
00e5f3a9c5 [primTorch] Move logsumexp decomp to refs (#91860)
Fixes #91843.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91860
Approved by: https://github.com/lezcano
2023-01-09 17:00:43 +00:00
Natalia Gimelshein
2c00064113 remove unnecessary decomps (#91828)
in favor of refs. Generated triton code is the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91828
Approved by: https://github.com/lezcano, https://github.com/soumith
2023-01-07 20:37:12 +00:00
PyTorch MergeBot
c73147f741 Revert "[decomp] Use new squeeze.dims overload in decompositions (#91602)"
This reverts commit 9262ffc692.

Reverted https://github.com/pytorch/pytorch/pull/91602 on behalf of https://github.com/clee2000 due to stacked pr was reverted, this is dependent
2023-01-05 20:39:52 +00:00
Peter Bell
9262ffc692 [decomp] Use new squeeze.dims overload in decompositions (#91602)
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
2023-01-05 17:59:32 +00:00
lezcano
484dd40022 Implement PReLU in a compositional way (#91238)
The PReLU implementation was all over the place. This lead to a number
of bugs like https://github.com/pytorch/pytorch/issues/68760.  We fix it by:
- Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel
- This second kernel is just a good-ol' pointwise kernel.
- We implement the derivative for the pointwise kernel via TI as well for speed.
- We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally

This fixes a number of issues:
- We don't perform copies any more when the inputs are not contiguous
- The derivatives are now correct
- We fix vmap and many other functorch-related issues.
- CPU and CUDA now share the relevant broadcasting logic
- The implementation is about 1/3 the length.

Fixes https://github.com/pytorch/pytorch/issues/68760
Fixes https://github.com/pytorch/pytorch/issues/89895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD
2022-12-30 10:42:30 +00:00
Joel Schlosser
8b55b86dbd Move sym_int and sym_float alongside SymInt / SymFloat in base torch package (#91317)
This PR moves the definitions for:
* `sym_int`
* `sym_ceil` (used only for `sym_int`)
* `sym_floor` (used only for `sym_int`)
* `sym_float`

from `torch/fx/experimental/symbolic_shapes.py` to `torch/__init__.py`, where `SymInt` and `SymFloat` are already defined.

This removes the need for several in-line imports, and enables proper JIT script gating for #91318. I'm very open to doing this in a better way!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91317
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2022-12-28 16:08:16 +00:00
Joel Schlosser
1c40ec46ff Decomps and meta registrations for upsample_nearest 1D / 2D / 3D (#91260)
Adds decompositions and meta registrations for the 1D, 2D, and 3D implementations of `upsample_nearest`. All related OpInfo-based tests for AOTAutograd now pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91260
Approved by: https://github.com/ezyang
2022-12-28 16:03:25 +00:00
Nikita Shulga
fd3a7264ae [MPS] Add group_norm[fwd+backward] and mean_var (take 2) (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private

Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-22 08:54:37 +00:00
PyTorch MergeBot
645eda0a00 Revert "[MPS] Add group_norm[fwd+backward] and mean_var (#91190)"
This reverts commit 371716eb36.

Reverted https://github.com/pytorch/pytorch/pull/91190 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names because of underscore _ops
2022-12-21 19:37:43 +00:00
Nikita Shulga
371716eb36 [MPS] Add group_norm[fwd+backward] and mean_var (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Depends on https://github.com/pytorch/pytorch/pull/91203

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-21 17:33:27 +00:00
Nikita Shulga
46f64117db [BE] Use aten global var (#91188)
s/torch.ops.aten/aten/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91188
Approved by: https://github.com/ngimel
2022-12-21 02:28:51 +00:00
Peter Bell
e670c261c5 Decompose fill, zero, and zeros_like (#90968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90968
Approved by: https://github.com/ngimel
2022-12-21 00:59:50 +00:00
Natalia Gimelshein
e689c50922 Don't recompute var in bn decomp (#90984)
Fixes https://github.com/pytorch/torchdynamo/issues/1988
Repeated `var` computation is not CSE'd for some reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90984
Approved by: https://github.com/Chillee
2022-12-16 21:38:49 +00:00
Brian Hirsh
7a683eaeb8 aot_autograd: add assert for functional-only graph (#88816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88816
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-16 21:04:36 +00:00
soulitzer
98a9235dce Fix prelu ref when a.ndim < 2 (#89809)
Fixes https://github.com/pytorch/pytorch/issues/89560

Previously the test case for "input is 1-D or scalar + weight is not scalar" did not exist; adding it introduced some failures:
- forward AD (fixed in this PR)
- vmap (filed https://github.com/pytorch/pytorch/issues/89895)
- ref/meta (fixed this PR, though this also regresses nvFuser support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89809
Approved by: https://github.com/ngimel
2022-12-12 23:55:31 +00:00
Bin Bao
282dfe8ba4 [inductor][Reland] Use decomposition for _to_copy (#90494)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90494
Approved by: https://github.com/ngimel
2022-12-09 16:51:50 +00:00
PyTorch MergeBot
e89685b0b5 Revert "[inductor] Use decomposition for _to_copy (#90314)"
This reverts commit 3fdb5f2dda.

Reverted https://github.com/pytorch/pytorch/pull/90314 on behalf of https://github.com/desertfire due to regresses performance on hf_Bert
2022-12-08 18:29:06 +00:00
Bin Bao
3fdb5f2dda [inductor] Use decomposition for _to_copy (#90314)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90314
Approved by: https://github.com/ngimel
2022-12-08 15:25:44 +00:00
Peter Bell
e6a7278753 Give std/var correction overloads proper defaults (#56398)
The correction overloads defaults were left off for forward
compatibility reasons, but this FC window expired well over a year ago
at this point.

Differential Revision: [D29625593](https://our.internmc.facebook.com/intern/diff/D29625593)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56398
Approved by: https://github.com/mruberry
2022-12-07 15:15:00 +00:00
Yanbo Liang
25f39c1bce Fix uniform ref implementation (#90094)
Fixes https://github.com/pytorch/torchdynamo/issues/1954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90094
Approved by: https://github.com/ngimel
2022-12-06 21:28:17 +00:00
Animesh Jain
c1950620c5 [decomp] Fix native_batch_norm_backward dtype of dweight and dbias (#89740)
Discovered while debugging an accuracy issue for Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89740
Approved by: https://github.com/soumith, https://github.com/ngimel
2022-11-29 03:15:20 +00:00
Brian Hirsh
e20ec44544 fixes for inductor <> batch norm (#89603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89603
Approved by: https://github.com/albanD
2022-11-29 02:16:52 +00:00
Jane Xu
8695f0cced Rectify native_batch_norm schema by splitting it into two legit schemas (#88697)
Using the same repro from the issue (but with BatchNorm2D)

Rectifies native_batch_norm schema by splitting the schema into 2:
1. one will have NON-optional alias-able running_mean and running_var inputs
2. the other will just not have those parameters at all (no_stats variation)

**Calling for name suggestions!**

## test plan
I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit`
CI should pass.

## next steps
Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697
Approved by: https://github.com/albanD
2022-11-23 23:23:17 +00:00
Elias Ellison
a8d6b82167 Fix norm decomp when dtype is passed in (#89508)
Fix for https://github.com/pytorch/torchdynamo/issues/1889. The wrapper was doing a downcast even when the dtype was explicitly passed in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89508
Approved by: https://github.com/anijain2305
2022-11-23 20:49:09 +00:00
Elias Ellison
72110d7833 Fix Upsample Decomp Striding For Small Channels (#89528)
Fix for https://github.com/pytorch/torchdynamo/issues/623.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2022-11-23 20:47:39 +00:00
lezcano
154e58c032 Add most in-place references/decompositions (#88117)
We add most in-place references in a generic way. We also implement a
wrapper to implement the annoying interface that `nn.functional`
nonlinearities have.

We fix along the way a couple decompositions for some non-linearities by
extending the arguments that the references have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88117
Approved by: https://github.com/mruberry
2022-11-18 14:59:46 +00:00
lezcano
3320915303 Fix decomp for embedding_backward and simplify the decomposition of embedding_dense and embedding_dense_backward (#87204)
See the title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87204
Approved by: https://github.com/Chillee
2022-11-16 17:46:54 +00:00
Sherlock Huang
5faa2792fa Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-15 13:34:45 +00:00
PyTorch MergeBot
eea506aee1 Revert "Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)"
This reverts commit 9eabcc370f.

Reverted https://github.com/pytorch/pytorch/pull/88761 on behalf of https://github.com/suo due to much broken 9eabcc370f
2022-11-14 01:58:47 +00:00
Sherlock Huang
9eabcc370f Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-13 21:30:53 +00:00
Horace He
37c5b42fa6 Fix matmul decomp to use reshape instead of contiguous().view() (#88832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88832
Approved by: https://github.com/bertmaher, https://github.com/ngimel
2022-11-12 00:15:42 +00:00
Ryan Spring
534ae6ae47 [primTorch] Implement group norm reference (#87054)
Add group norm reference
Split from #81191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87054
Approved by: https://github.com/mruberry
2022-11-11 01:08:20 +00:00
Sherlock Huang
c00c34fb69 Fix meta for aten.upsample_bilinear2d.vec (#88158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88158
Approved by: https://github.com/ngimel
2022-11-02 16:58:29 +00:00
Sherlock Huang
de1f641f11 Fix meta function for aten.addmm (#88068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88068
Approved by: https://github.com/albanD
2022-11-01 17:05:48 +00:00
lezcano
fd27246c16 Fix decomposition for std (#87181)
The previous implementation was lacking a few features and incurred on a
pretty large error

cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87181
Approved by: https://github.com/ngimel, https://github.com/peterbell10
2022-10-28 00:50:29 +00:00
Sherlock Huang
eb99c1efce Prefer python meta function over c++ meta function (#87426)
This is a policy update for meta registration. **We now prefer python meta implementation over C++ meta function.**  This is a flip of the previous policy, where we prefer C++ meta function over python meta function if they both exist.

Here's the meta registration process:
1. register_meta and register_decomposition will place the python meta/decomp functions into the `global_decomp_table`.  However, they will NOT register them into dispatcher.
2. After global_decomp_table is populated, we will compile an `active_meta_table`. For a given op, we pick the most specific decomp function from `global_decomp_table` in the preference order of Meta > PostAutograd > PreAutograd.
3. We will unconditionally register all of them into python dispatcher. And register them into C++ dispatcher, unless it one of the following 3 cases
- 1. the op is a CompositeImplicitAutograd, and should rely on decomposed op's meta
- 2. the op is a view op, as the MetaTensor doesn't support aliased storage
- 3. the op is in the blocklist (due to UT failures, and we will burn down this list op by op)

Over the long run, we wish to implement all meta functions in python. With this PR, 321 op_overloads will have cpp meta overridden by python meta. There are still 400 op_overloads is using cpp meta. The exact list can be found here https://gist.github.com/SherlockNoMad/d20bb736178df8eebd3b054c8bb7cdc5

cc @ngimel @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87426
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-10-25 16:49:02 +00:00
Ryan Spring
9bb4926de0 Add xlogy and xlog1py references (#77712)
* Add reference implementations for `xlogy` and `xlog1py`
 * Replace `_wrap_scalar` helper function with `scalar_tensor` prim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77712
Approved by: https://github.com/mruberry
2022-10-22 17:59:25 +00:00
Edward Z. Yang
d73d4aa7de Audit for error prone isinstance int/float and add lint (#87345)
We recently fixed a bug on symbolic-shapes branch where
an isinstance(x, int) test failed when passed a SymIntNode.
To prevent this, I've added a lint for all the codepaths
where we may pass SymInt/SymFloat directly to reject
direct isinstance int/float tests, and instead use one of
the aliases.  The lint rule explains the options.  I then
go and fix all of them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87345
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2022-10-21 15:55:24 +00:00
Sherlock Huang
f7da9db9c1 Unify decomp registries into global_decomposition_table (#86857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86857
Approved by: https://github.com/ezyang
2022-10-20 21:29:05 +00:00
Sherlock Huang
ef045695e0 Fix decomp for huber_loss_backward (#86955)
Fixes https://github.com/pytorch/pytorch/issues/86846

aten.huber_loss_backward calls aten.huber_loss_backward.out in its CompositeExplicitAutograd kernel.
The decomp was mistaken registered for both aten.huber_loss_backward.default and aten.huber_loss_backward.out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86955
Approved by: https://github.com/Chillee
2022-10-14 18:53:02 +00:00
Nikita Karetnikov
4460e40db4 [primTorch] Add a ref for addcmul (#86731)
Based on:
https://github.com/pytorch/pytorch/pull/79827
https://github.com/pytorch/pytorch/pull/72949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86731
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-14 14:26:23 +00:00
Brian Hirsh
e17732b234 [test] add cross-ref tests for python meta kernels (#86228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86228
Approved by: https://github.com/albanD
2022-10-13 14:14:26 +00:00
Elias Ellison
d3f7c34cb3 Enable aten-aten decomps (#85921)
Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
Approved by: https://github.com/ezyang
2022-10-08 05:12:42 +00:00
PyTorch MergeBot
7ec12a559c Revert "Enable aten-aten decomps (#85921)"
This reverts commit 62e4f51efd.

Reverted https://github.com/pytorch/pytorch/pull/85921 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. I think it breaks a dynamo test in trunk 62e4f51efd
2022-10-08 01:59:54 +00:00
Elias Ellison
62e4f51efd Enable aten-aten decomps (#85921)
Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
Approved by: https://github.com/ezyang
2022-10-07 21:04:39 +00:00
lezcano
28a0b3fb18 Fix col2im and im2col decompositions (#86426)
I threw in some tests for good measure.

Fixes https://github.com/pytorch/pytorch/issues/86332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86426
Approved by: https://github.com/ngimel
2022-10-07 08:14:06 +00:00
Elias Ellison
9ceadcadb2 Fix unfold backward decomp aliasing for 0 dim input (#86428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86428
Approved by: https://github.com/ngimel, https://github.com/ezyang
2022-10-07 03:55:31 +00:00
lezcano
b67e022833 Fix ref / decomposition index_add (#86266)
The decomposition of `index_add` was using `slice(None)`, when it should
use just `None`.

The reference for index_add was also wrong, as `x[idx] += t` does not
use atomic add, so it does not work when several `idx`s point to the
same location.

This PR adds extra reference inputs to help test for this.

Fixes https://github.com/pytorch/torchdynamo/issues/1356
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86266
Approved by: https://github.com/ngimel
2022-10-05 19:59:15 +00:00
lezcano
c609768896 Add refs for torch.unfold and a decomposition for its backward. (#85629)
It's not clear to me what's the difference between `unfold` and `unfold_copy`, as this latter one is codegen'd

I also took this chance to clean the implementation of unfold and its reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85629
Approved by: https://github.com/mruberry
2022-10-05 12:15:49 +00:00
Edward Z. Yang
d07b85393a SymInt fixes from symbolic-shapes branch (#86242)
symintify a few inplace meta functions

symintify resize_(), nbytes(), functionalization input mutations

meta funcs for avg_pool2d_backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86242
Approved by: https://github.com/Chillee
2022-10-05 04:52:02 +00:00
Peter Bell
b317736c39 Fix default correction value in std/var decompositions (#85839)
`torch.std` and `torch.var` default to the unbiased estimator, i.e.
`correction=1`. This only works as is because the default on this
overload is not exercised by the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85839
Approved by: https://github.com/ezyang
2022-10-04 23:23:39 +00:00
Horace He
82d9592f1b Batch of symintifications to allow more models to pass in inference (#86104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86104
Approved by: https://github.com/ezyang
2022-10-04 04:01:58 +00:00
Edward Z. Yang
f3d7ab5438 Unconditionally register Python decomps to Meta key in Python Dispatcher (#85750)
This makes them available for Python Dispatcher to service them when
symbolic shapes are involved.  This is needed because under certain
conditions, functionalization will directly call the Meta kernel for a
function in order to produce a properly sized output wrapper tensor
for a view operation. This direct call bypasses the normal decomposition
table mechanism.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85750
Approved by: https://github.com/wconstab
2022-10-03 22:49:25 +00:00
Horace He
37013bb443 Added _unsafe_view decomp (#86103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86103
Approved by: https://github.com/ezyang
2022-10-03 20:38:31 +00:00
lezcano
07ce0b435b Remove backward for im2col and col2im (#85542)
`im2col` is a linear map, and `col2im` is its adjoint. As such, the
adjoint to `col2im` is `im2col` (the adjoint of the adjoint is the
original function.

There's no point having explicit derivatives in ATen for these
functions, so this PR deletes all these.

Furthermore, along the way, we fix an error for the derivative of im2col
for non-batched inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85542
Approved by: https://github.com/soulitzer, https://github.com/ngimel
2022-10-03 00:16:42 +00:00
Horace He
e6dd2965af A bunch of coverage improvements (re for models in inference snext50, BERT_pytorch, mobilenet_v3_large, pytorch_CycleGAN_and_pix2pix, dcgan, resnet18, mnasnet1_0) (#86050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86050
Approved by: https://github.com/ezyang
2022-10-02 20:46:20 +00:00
lezcano
787028cadb Implement col2im decomposition and fix im2col and add a few preconditions (#85541)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85541
Approved by: https://github.com/jansel
2022-09-30 09:31:53 +00:00
Elias Ellison
6a2b12dd65 Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85471
Approved by: https://github.com/ezyang
2022-09-28 23:06:59 +00:00
Animesh Jain
796da4df4d Return contiguous tensor from softmax decomposition (#85788)
Fixes https://github.com/pytorch/torchdynamo/issues/1135

Softmax decomp's output stride does not match with aten softmax output stride. Not sure if its desirable. Opening a PR for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85788
Approved by: https://github.com/ngimel, https://github.com/ezyang
2022-09-28 20:52:45 +00:00
Nikita Karetnikov
8dd45424ea [primTorch] Add ref for huber_loss and error inputs (#85041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85041
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-09-28 19:56:17 +00:00
Edward Z. Yang
793488cda2 Revert "Revert "Symintifying slice ops (#85196)"" (#85746)
This reverts commit 3a171dfb0c.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85746
Approved by: https://github.com/albanD
2022-09-28 04:37:35 +00:00
PyTorch MergeBot
3a171dfb0c Revert "Symintifying slice ops (#85196)"
This reverts commit 4c01c51266.

Reverted https://github.com/pytorch/pytorch/pull/85196 on behalf of https://github.com/atalman due to Break internal build Exutorch
2022-09-27 18:01:27 +00:00
Fabio Rocha
d5ce2bbed2 [primTorch] decompositions for upsample_bicubic2d (#85403)
FYI, this decomposition seems to be significantly slower than the lowering in torchinductor:

```
------------------------------------- upsample_bicubic2d -------------------------------------]
                                                              |  lowering  |  Inductor  |  Eager
32 threads: ------------------------------------------------------------------------------------
      (torch.Size([16, 4, 128, 256]),), ((512, 1024), True)   |    1.8     |   3.880    |   1.4
      (torch.Size([16, 4, 128, 256]),), ((512, 1024), False)  |    1.9     |   3.887    |   1.4
```

This seems related to the fact that in the lowering we can use int32s as the indices and in the decomp we can only use int64s (see https://github.com/pytorch/torchdynamo/issues/1293).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85403
Approved by: https://github.com/ngimel
2022-09-26 20:11:23 +00:00
Elias Ellison
bcc544e9d7 Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
Approved by: https://github.com/ezyang
2022-09-26 17:08:14 +00:00
Fabio Rocha
ffaff8896a Removed None arg check in test/test_decomp.py (#85402)
Not sure why this check was necessary? Tests seem to run fine without
it.
There were definitely tests this was skipping before that it shouldn't,
e.g., pretty much all of the tests for `torch.nn.functional.interpolate`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85402
Approved by: https://github.com/ezyang
2022-09-24 11:37:27 +00:00
Edward Z. Yang
4c01c51266 Symintifying slice ops (#85196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85196
Approved by: https://github.com/ezyang
2022-09-23 22:01:32 +00:00
PyTorch MergeBot
d10de31cc8 Revert "Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)"
This reverts commit 78afa0cf0c.

Reverted https://github.com/pytorch/pytorch/pull/85417 on behalf of https://github.com/clee2000 due to broke tests on trunk 78afa0cf0c
2022-09-23 17:21:43 +00:00
PyTorch MergeBot
3b195fd33e Revert "Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)"
This reverts commit 1e92eb8068.

Reverted https://github.com/pytorch/pytorch/pull/85471 on behalf of https://github.com/clee2000 due to stacked prs https://github.com/pytorch/pytorch/pull/85417 and https://github.com/pytorch/pytorch/pull/85434 broke trunk, reverting this so i can revert the others
2022-09-23 17:13:35 +00:00
Elias Ellison
1e92eb8068 Turn on aliasing tests for fake backwards, Fix Batch norm running mean/var decomp aliasing (#85471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85471
Approved by: https://github.com/ezyang
2022-09-23 16:02:15 +00:00
Elias Ellison
78afa0cf0c Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
Approved by: https://github.com/ezyang
2022-09-23 15:50:03 +00:00
Ryan Spring
71dddec6ea Cast grad_input to half when input_dtype is half in _softmax_backward_data aten decomposition (#85497)
Fixes #85504

`_softmax_backward_data` and `_log_softmax_backward_data` cast `grad_input` to half when the `input_dtype` is half.
When running with amp without the cast, consumer ops can trigger `RuntimeError: expected scalar type Float but found Half`.

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/SoftMax.cpp#L70-L83
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/SoftMax.cpp#L102-L113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85497
Approved by: https://github.com/ngimel
2022-09-23 06:52:38 +00:00
PyTorch MergeBot
5043457a8e Revert "Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)"
This reverts commit 9c77083965.

Reverted https://github.com/pytorch/pytorch/pull/85417 on behalf of https://github.com/clee2000 due to broke tests on trunk (and pull somehow) 9c77083965
2022-09-22 15:44:38 +00:00
Elias Ellison
9c77083965 Add FakeCrossRef tests for backwards, Fix Layer Norm Backward Decomp (#85417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85417
Approved by: https://github.com/ezyang
2022-09-22 13:03:57 +00:00
Horace He
2f4a517d67 Ported matmul compositeimplicitautograd impl into core (#85239)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85239
Approved by: https://github.com/ezyang, https://github.com/lezcano
2022-09-21 09:25:24 +00:00
lezcano
d17b144e65 Adding multigammaln ref and fix arange (#85153)
Partially based on https://github.com/pytorch/pytorch/pull/83662.

I'll help land this one, as Rob does not work in the PyTorch project
anymore

I removed the data-dependent check for the args, as data dependencies
are bad for many reasons (and it was failing when the input has NaNs).

It also registers arange as a decomposition, and fixes the naming of its
args.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85153
Approved by: https://github.com/mruberry, https://github.com/ngimel
2022-09-20 17:52:56 +00:00
lezcano
5dd9610e9d Refs and decompositions for index_{add,copy,select,fill} (#85002)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85002
Approved by: https://github.com/ngimel
2022-09-17 19:57:34 +00:00
PyTorch MergeBot
e33b464ffc Revert "Refs and decompositions for index_{add,copy,select,fill} (#85002)"
This reverts commit 2f0b3de443.

Reverted https://github.com/pytorch/pytorch/pull/85002 on behalf of https://github.com/huydhn due to Broke trunk slow tests
2022-09-17 04:26:04 +00:00
lezcano
2f0b3de443 Refs and decompositions for index_{add,copy,select,fill} (#85002)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85002
Approved by: https://github.com/ngimel
2022-09-16 23:59:35 +00:00
Sherlock Huang
29eba319b4 Use alias for nop decomp (#84727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84727
Approved by: https://github.com/Chillee
2022-09-16 18:50:56 +00:00
Natalia Gimelshein
6162a04364 fix half_to_float arg in *softmax decomp (#85120)
Fixes https://github.com/pytorch/torchdynamo/issues/1239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85120
Approved by: https://github.com/Chillee
2022-09-16 15:54:50 +00:00
soulitzer
7f88934a8f [reland 2] Call jit decomp in VariableType to improve forward AD coverage (#84976)
Reland of https://github.com/pytorch/pytorch/pull/84675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84976
Approved by: https://github.com/zou3519
2022-09-15 22:46:19 +00:00
PyTorch MergeBot
36d79143ce Revert "[reland] Call jit decomposition in VariableType to increase forward AD coverage (#84151) (#84675)"
This reverts commit bb4e96c964.

Reverted https://github.com/pytorch/pytorch/pull/84675 on behalf of https://github.com/osalpekar due to causing asan xplat link-time errors like ld.lld: error: undefined symbol: torch::jit::has_jit_decomposition(c10::FunctionSchema const&)
2022-09-13 22:54:54 +00:00
soulitzer
bb4e96c964 [reland] Call jit decomposition in VariableType to increase forward AD coverage (#84151) (#84675)
This reverts commit acb4a09628.

In addition, we also fix a memory leak in layer norm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84675
Approved by: https://github.com/zou3519
2022-09-12 20:33:14 +00:00
Horace He
1459a909b4 Added mv, mm, and binary_cross_entropy_with_logits decomps (#84451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84451
Approved by: https://github.com/ngimel
2022-09-08 17:56:18 +00:00
soulitzer
e31ad1c2d3 [reland] Move decompositions and helpers for jvp from functorch into core (#84581)
Reland of https://github.com/pytorch/pytorch/pull/84358
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84581
Approved by: https://github.com/samdow
2022-09-07 15:31:46 +00:00
Ivan Yashchuk
6363b1b358 Add nvFuser support for aten.native_batch_norm_backward (#84546)
Replacing `tensor.reshape(broadcast_mask)` with unsqueezes makes the implementation of `batch_norm_backward` more friendly for PrimTorch+nvFuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84546
Approved by: https://github.com/Chillee
2022-09-06 19:56:17 +00:00
Fabio Rocha
91a5f52f51 Decomp for nn.functional.grid_sampler_2d (#84350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84350
Approved by: https://github.com/jansel, https://github.com/Lezcano
2022-09-05 21:33:26 +00:00
lezcano
3dfbf09afe Optimise the decomposition for adaptive_avg_pool2d wrt. TorchInductor (#84483)
This fixes some part of the implementation that did not work with
TorchInductor (e.g. the indices in TorchInductor need to be `int64`s,
while in PyTorch we can have `int32`s).

It also brings up the performance of the kernel to similar numbers than
those of the lowering (benchmarks below).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84483
Approved by: https://github.com/jansel
2022-09-02 22:25:09 +00:00
PyTorch MergeBot
375d6cd5b7 Revert "Move decompositions and helpers for jvp from functorch into core (#84358)"
This reverts commit a3c60a4db4.

Reverted https://github.com/pytorch/pytorch/pull/84358 on behalf of https://github.com/malfet due to Broke lint
2022-09-01 23:42:48 +00:00
soulitzer
a3c60a4db4 Move decompositions and helpers for jvp from functorch into core (#84358)
This refactor shouldn't change any behavior. At this point functorch still relies on the mechanism in DynamicLayerFront; we just moved some parts of it into core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84358
Approved by: https://github.com/samdow
2022-09-01 22:39:15 +00:00
Sherlock Huang
ef3ab31f1c Decomp for aten.im2col (#84303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84303
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-09-01 00:06:35 +00:00
Nikita Karetnikov
71ce9cd072 [primTorch] Add decomp for soft_margin_loss (#83804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83804
Approved by: https://github.com/Lezcano, https://github.com/ngimel
2022-08-31 17:39:34 +00:00
Nikita Shulga
b8e1c54f53 [Prim] Implement group_norm_backward (#84037)
Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus:
```
#!/usr/bin/env python3.8
import torch

func = torch.ops.aten.native_group_norm_backward.default
decomp =  torch._decomp.decomposition_table[func]
for args in (
        (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]),
        (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])):
    nrc=func(*args)
    drc=decomp(*args)
    for i in range(len(nrc)):
       print(i, torch.max(nrc[i]-drc[i]))
    print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc)))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037
Approved by: https://github.com/Chillee, https://github.com/ngimel
2022-08-29 09:29:30 +00:00
Natalia Gimelshein
533203f5aa _to_copy decomp (#84108)
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108
Approved by: https://github.com/Chillee
2022-08-29 02:25:02 +00:00
lezcano
9fc02f6bc5 Decomposition for adaptive_avg_pool2d (#84062)
This was already implemented as a lowering in https://github.com/pytorch/torchdynamo/pull/962. I'm putting the idea up here ~(I haven't even run this code, so it surely has *many* issues, but I reckon the general idea should hopefully be alright).~ The tests now pass and I corrected the issues that the first implementation had.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84062
Approved by: https://github.com/jansel
2022-08-29 01:38:51 +00:00
PyTorch MergeBot
33db5da4c1 Revert "[Prim] Implement group_norm_backward (#84037)"
This reverts commit bed85cce8b.

Reverted https://github.com/pytorch/pytorch/pull/84037 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 17:30:50 +00:00
PyTorch MergeBot
ff23f3ac1c Revert "_to_copy decomp (#84108)"
This reverts commit e33897cb99.

Reverted https://github.com/pytorch/pytorch/pull/84108 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 13:27:49 +00:00
Natalia Gimelshein
e33897cb99 _to_copy decomp (#84108)
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84108
Approved by: https://github.com/Chillee
2022-08-27 03:51:03 +00:00
Nikita Shulga
bed85cce8b [Prim] Implement group_norm_backward (#84037)
Test plan: CI, i.e. `python3 test_decomp.py -v -k test_comprehensive_nn_functional_group_norm` plus:
```
#!/usr/bin/env python3.8
import torch

func = torch.ops.aten.native_group_norm_backward.default
decomp =  torch._decomp.decomposition_table[func]
for args in (
        (torch.rand(1, 6, 3), torch.rand(1, 6, 3), torch.rand(1, 2), torch.rand(1, 2), torch.rand(6), 1, 6, 3, 2, [True, True, True]),
        (torch.rand(64, 768, 7, 7), torch.rand(64, 768, 7, 7), torch.rand(64, 1), torch.rand(64, 1), torch.rand(768), 64, 768, 49, 1, [True, True, True])):
    nrc=func(*args)
    drc=decomp(*args)
    for i in range(len(nrc)):
       print(i, torch.max(nrc[i]-drc[i]))
    print(all(torch.allclose(x, y) for (x, y) in zip(nrc, drc)))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84037
Approved by: https://github.com/Chillee, https://github.com/ngimel
2022-08-27 01:10:27 +00:00
Horace He
9a236c7ab4 Made some minor cleanups to decompositions (#83814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83814
Approved by: https://github.com/ngimel
2022-08-26 10:55:31 +00:00
Animesh Jain
e2f75d63d4 Decomposition - batch_norm, save_mean and save_variance always float32 (#84013)
AMP error shown here - https://github.com/pytorch/torchdynamo/issues/835

Test missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84013
Approved by: https://github.com/ezyang
2022-08-25 16:09:52 +00:00
Ivan Yashchuk
473b733bae Replace .new_zeros(()) with 0.0 in torch/_decomp/decompositions (#83734)
`new_zeros` is decomposed into `prims.empty_strided`+`prims.fill`+`prims.copy_to` and none of these are supported by prims+nvFuser executor currently.
Replacing it with 0.0 makes these backward decompositions nvFuser friendly.

Example with `torch.ops.aten.hardsigmoid_backward.default`:
```py
# Before this PR
opcode         name                      target                            args                                                          kwargs
-------------  ------------------------  --------------------------------  ------------------------------------------------------------  ----------------------------------------------------------------------------------------
placeholder    a_1                       a_1                               ()                                                            {}
placeholder    g_1                       g_1                               ()                                                            {}
call_function  gt_default                nvprims.gt.default                (a_1, -3.0)                                                   {}
call_function  lt_default                nvprims.lt.default                (a_1, 3.0)                                                    {}
call_function  bitwise_and_default       nvprims.bitwise_and.default       (gt_default, lt_default)                                      {}
call_function  mul_default               nvprims.mul.default               (g_1, 0.16666666666666666)                                    {}
call_function  empty_strided             prims.empty_strided.default       ([], [])                                                      {'dtype': torch.float32, 'device': device(type='cuda', index=0), 'requires_grad': False}
call_function  fill_default              prims.fill.default                (empty_strided, 0)                                            {}
call_function  copy_to_default           prims.copy_to.default             (empty_strided, fill_default)                                 {}
call_function  broadcast_in_dim_default  nvprims.broadcast_in_dim.default  (copy_to_default, [3, 2], [])                                 {}
call_function  where_default             nvprims.where.default             (bitwise_and_default, mul_default, broadcast_in_dim_default)  {}
output         output                    output                            (where_default,)                                              {}

# After this PR
opcode         name                 target                       args                                     kwargs
-------------  -------------------  ---------------------------  ---------------------------------------  --------
placeholder    a_1                  a_1                          ()                                       {}
placeholder    g_1                  g_1                          ()                                       {}
call_function  gt_default           nvprims.gt.default           (a_1, -3.0)                              {}
call_function  lt_default           nvprims.lt.default           (a_1, 3.0)                               {}
call_function  bitwise_and_default  nvprims.bitwise_and.default  (gt_default, lt_default)                 {}
call_function  mul_default          nvprims.mul.default          (g_1, 0.16666666666666666)               {}
call_function  where_default        nvprims.where.default        (bitwise_and_default, mul_default, 0.0)  {}
output         output               output                       (where_default,)                         {}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83734
Approved by: https://github.com/Chillee
2022-08-22 09:12:13 +00:00