Commit Graph

590 Commits

Author SHA1 Message Date
Aaron Gokaslan
12c4a2c297 [BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716)
Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716
Approved by: https://github.com/ezyang
2024-06-03 17:22:13 +00:00
PyTorch MergeBot
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
Isuru Fernando
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
Peter Bell
39de62845a [decomp] Fix default values missing from inplace rrelu decomposition (#126978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978
Approved by: https://github.com/lezcano
2024-05-26 23:49:40 +00:00
Andres Lugo-Reyes
38b8b614a2 [ROCm] Implement forward AD for miopen_batch_norm (#125069)
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
2024-05-14 19:09:50 +00:00
Edward Z. Yang
4731130ea8 Add a code comment about torch._check_is_size in tensor_split (#125292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125292
Approved by: https://github.com/albanD
2024-05-02 02:25:38 +00:00
Aaron Orenstein
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
PyTorch MergeBot
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934b.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
Aaron Orenstein
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
Peter Bell
58806d6531 [decomp] Remove dead device_hint function (#124849)
The only use of this function is in `_to_copy` but the result is never used,
so this is just dead code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124849
Approved by: https://github.com/lezcano
2024-04-25 11:25:51 +00:00
Isuru Fernando
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
vfdev-5
6330acae76 Refactored implementation for upsample_nearest decompostions (#122783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783
Approved by: https://github.com/peterbell10
2024-04-17 23:05:40 +00:00
Edward Z. Yang
60d7fbe89a Register matmul out variant so it is used (#122979)
Fixes https://github.com/pytorch/pytorch/issues/122774

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122979
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2024-04-09 22:21:37 +00:00
Andrew M. James
bde1a93bc4 Add lowering for resize, decomp for resize_as. (#122317)
This has been split off from #121354 as the inplace version of these
methods prove to be rather tricky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-04-03 17:47:29 +00:00
vfdev-5
38946bff51 Added DispatchKey.CompositeImplicitAutograd to all upsample_nearest*.default decompositions (#122782)
Related to https://github.com/pytorch/pytorch/pull/117632#issuecomment-2021321172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122782
Approved by: https://github.com/ezyang
2024-03-29 13:55:25 +00:00
vfdev-5
b524a404e0 Fixed support for uint8 in upsample bicubic2d decomposition (#120411)
Superseeds https://github.com/pytorch/pytorch/pull/104248

Description:
- Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown)
- Added missing clamp(0, 1) for xscale and yscale
  - slowdown for f32 on cpu. PR on nodes fusion on CPU: https://github.com/pytorch/pytorch/pull/120077 can help for upsampling cases with align corners = true
  - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu.
- Removed lowering implementation

Benchmarks:
```
[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                   |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |        613.029 (+-1.590)        |         5477.608 (+-9.027)         |           3060.314 (+-12.368)           |     0.559 (+-0.000)      |          608.735 (+-6.336)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |        610.176 (+-1.428)        |        5718.503 (+-11.203)         |           3424.022 (+-12.836)           |     0.599 (+-0.000)      |          604.781 (+-6.229)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        325.001 (+-0.840)        |        6183.029 (+-10.893)         |            3275.032 (+-7.625)           |     0.530 (+-0.000)      |          325.693 (+-1.067)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        325.855 (+-1.108)        |        6391.394 (+-11.552)         |            3533.410 (+-7.666)           |     0.553 (+-0.000)      |          325.838 (+-1.457)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       2521.533 (+-14.857)       |        5025.217 (+-13.415)         |            2814.304 (+-6.742)           |     0.560 (+-0.000)      |         2520.308 (+-10.796)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       2531.204 (+-12.534)       |        5294.925 (+-11.994)         |            3147.590 (+-6.808)           |     0.594 (+-0.000)      |         2521.228 (+-11.732)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |        758.352 (+-10.362)       |        5639.912 (+-14.495)         |            3014.123 (+-8.799)           |     0.534 (+-0.000)      |          756.114 (+-4.792)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |        758.712 (+-5.781)        |         5927.541 (+-9.982)         |            3249.555 (+-7.226)           |     0.548 (+-0.000)      |          757.719 (+-5.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       1524.469 (+-12.860)       |        34321.641 (+-80.310)        |           19373.714 (+-56.351)          |     0.564 (+-0.000)      |         1518.082 (+-49.653)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       1521.746 (+-13.780)       |        35949.711 (+-81.010)        |           21782.366 (+-68.938)          |     0.606 (+-0.000)      |         1467.911 (+-15.901)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |        712.311 (+-5.361)        |        38826.510 (+-92.267)        |           20762.314 (+-59.303)          |     0.535 (+-0.000)      |          712.669 (+-4.673)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |        715.060 (+-4.757)        |        40269.353 (+-92.543)        |           22402.114 (+-81.574)          |     0.556 (+-0.000)      |          716.001 (+-8.945)

      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)       |       2331.889 (+-29.159)       |        21541.096 (+-72.346)        |           12181.194 (+-45.288)          |     0.565 (+-0.000)      |         2304.864 (+-21.351)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)      |       2333.697 (+-10.066)       |        22514.154 (+-57.798)        |           21709.449 (+-98.307)          |     0.964 (+-0.000)      |         2302.141 (+-13.041)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)           |        1198.768 (+-5.364)       |       37652.371 (+-101.644)        |           42740.413 (+-98.571)          |     1.135 (+-0.000)      |          1197.104 (+-7.225)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)          |        1196.851 (+-5.118)       |       39678.341 (+-173.750)        |           46807.738 (+-92.744)          |     1.180 (+-0.000)      |          1189.322 (+-5.681)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)     |       10020.978 (+-54.855)      |        19955.290 (+-71.891)        |           11420.521 (+-53.179)          |     0.572 (+-0.000)      |         9999.583 (+-61.230)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)    |       10066.441 (+-62.700)      |       21058.334 (+-183.414)        |           19986.577 (+-65.304)          |     0.949 (+-0.000)      |         10018.672 (+-59.188)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)         |       3171.135 (+-14.635)       |        19687.864 (+-54.320)        |           23313.699 (+-57.391)          |     1.184 (+-0.000)      |         3182.191 (+-17.686)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)        |       3181.314 (+-13.784)       |        20224.468 (+-50.827)        |          30541.963 (+-381.385)          |     1.510 (+-0.000)      |         3183.578 (+-16.203)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)       |       5879.450 (+-31.551)       |       136918.555 (+-480.320)       |          77723.568 (+-331.766)          |     0.568 (+-0.000)      |         5726.061 (+-87.517)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)      |       5882.869 (+-30.325)       |       143378.094 (+-513.842)       |         137244.074 (+-4827.730)         |     0.957 (+-0.000)      |         5727.679 (+-22.164)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)           |       2674.937 (+-45.003)       |      244829.360 (+-1930.579)       |         271283.073 (+-2243.245)         |     1.108 (+-0.000)      |         2676.054 (+-24.632)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)          |       2676.217 (+-16.601)       |      248658.668 (+-2904.952)       |         296514.520 (+-2983.281)         |     1.192 (+-0.000)      |         2682.844 (+-19.886)

      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |        1768.437 (+-6.294)       |        2934.013 (+-28.870)         |            2520.649 (+-6.797)           |     0.859 (+-0.000)      |          1759.292 (+-5.097)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |        1748.660 (+-5.550)       |         3271.104 (+-7.557)         |            2891.306 (+-7.632)           |     0.884 (+-0.000)      |          1746.341 (+-5.845)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |        2813.150 (+-6.656)       |         3258.973 (+-7.543)         |            2766.286 (+-6.473)           |     0.849 (+-0.000)      |          2805.077 (+-7.611)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |        2812.102 (+-8.211)       |         3568.780 (+-9.018)         |            3125.870 (+-7.324)           |     0.876 (+-0.000)      |          2834.178 (+-9.034)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |        1687.975 (+-9.527)       |         2752.085 (+-9.627)         |            2373.274 (+-7.888)           |     0.862 (+-0.000)      |          1698.782 (+-8.098)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |        1696.606 (+-8.678)       |        3056.317 (+-13.303)         |           2699.160 (+-10.638)           |     0.883 (+-0.000)      |         1684.942 (+-10.519)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |        2613.491 (+-9.769)       |        3176.493 (+-13.366)         |            2730.193 (+-9.573)           |     0.859 (+-0.000)      |          2625.085 (+-9.943)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       2614.946 (+-34.129)       |        3465.398 (+-11.165)         |           3044.396 (+-11.447)           |     0.879 (+-0.000)      |          2627.355 (+-9.608)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |       10784.549 (+-58.181)      |        18292.452 (+-59.344)        |           15909.922 (+-49.864)          |     0.870 (+-0.000)      |         10837.656 (+-51.947)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |       10786.513 (+-52.308)      |        20449.038 (+-56.204)        |           18295.997 (+-54.522)          |     0.895 (+-0.000)      |         10843.751 (+-44.781)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |       17532.699 (+-64.807)      |        20425.699 (+-80.271)        |           17517.040 (+-79.705)          |     0.858 (+-0.000)      |         17595.597 (+-61.870)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |       17530.816 (+-55.131)      |        22450.080 (+-92.899)        |           19827.828 (+-77.649)          |     0.883 (+-0.000)      |         17615.934 (+-71.716)

      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)     |       6875.484 (+-40.543)       |        11569.509 (+-62.462)        |          10053.350 (+-208.136)          |     0.869 (+-0.000)      |         6864.501 (+-46.747)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)    |       6843.126 (+-44.498)       |        12915.236 (+-60.654)        |          25335.058 (+-382.640)          |     1.962 (+-0.000)      |         6899.002 (+-46.861)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (256, 256)         |       11103.418 (+-51.318)      |        28834.389 (+-78.395)        |          37405.463 (+-581.646)          |     1.297 (+-0.000)      |         11223.012 (+-60.709)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (256, 256)        |       11092.994 (+-70.835)      |       36597.023 (+-118.988)        |           45761.267 (+-85.051)          |     1.250 (+-0.000)      |         11104.014 (+-61.288)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)   |       7106.791 (+-63.666)       |        11191.071 (+-45.402)        |           9786.037 (+-75.781)           |     0.874 (+-0.000)      |         7129.419 (+-77.674)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)  |       7146.519 (+-28.376)       |        12443.571 (+-39.425)        |           20147.067 (+-74.771)          |     1.619 (+-0.000)      |         7179.622 (+-64.847)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (200, 300)       |       10533.849 (+-44.227)      |       34814.909 (+-138.127)        |          42803.001 (+-114.326)          |     1.229 (+-0.000)      |         10644.039 (+-59.681)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (200, 300)      |       10548.910 (+-44.221)      |       42876.940 (+-146.959)        |          49711.443 (+-139.276)          |     1.159 (+-0.000)      |         10652.375 (+-44.174)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)     |      42814.521 (+-103.198)      |       73100.489 (+-435.262)        |          63587.659 (+-134.266)          |     0.870 (+-0.000)      |        43208.921 (+-195.287)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)    |      42812.373 (+-103.870)      |       81769.160 (+-373.369)        |         175159.813 (+-2028.558)         |     2.142 (+-0.000)      |         43007.691 (+-96.358)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (600, 700)         |      69955.505 (+-373.373)      |      215248.616 (+-2040.775)       |         267511.246 (+-2094.161)         |     1.243 (+-0.000)      |        70382.679 (+-594.941)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (600, 700)        |      69852.157 (+-490.076)      |      242841.484 (+-19645.513)      |         317931.678 (+-2016.498)         |     1.309 (+-0.000)      |        70074.819 (+-352.919)

Times are in microseconds (us).

[-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                     |  Eager (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git0c61c20) PR  |  Compiled (2.4.0a0+git069270d) Nightly  |  speed-up PR vs Nightly  |  Eager (2.4.0a0+git069270d) Nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |         97.727 (+-0.018)        |          97.765 (+-0.025)          |             97.773 (+-0.027)            |     1.000 (+-0.000)      |           97.905 (+-0.040)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.615 (+-0.066)        |          97.332 (+-0.032)          |             97.950 (+-0.026)            |     1.006 (+-0.000)      |           97.690 (+-0.062)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        100.635 (+-0.033)        |         125.883 (+-0.020)          |            102.499 (+-0.116)            |     0.814 (+-0.000)      |          101.103 (+-0.027)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        100.898 (+-0.036)        |         109.717 (+-0.336)          |            102.558 (+-0.120)            |     0.935 (+-0.000)      |          101.642 (+-0.105)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)   |        462.853 (+-0.028)        |         382.475 (+-0.047)          |            382.472 (+-0.033)            |     1.000 (+-0.000)      |          462.188 (+-0.014)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)  |        462.783 (+-0.021)        |         382.806 (+-0.037)          |            382.563 (+-0.043)            |     0.999 (+-0.000)      |          462.089 (+-0.028)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345)       |        466.721 (+-0.022)        |         384.438 (+-0.027)          |            384.886 (+-0.037)            |     1.001 (+-0.000)      |          467.014 (+-0.025)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345)      |        466.993 (+-0.032)        |         384.212 (+-0.009)          |            383.946 (+-0.029)            |     0.999 (+-0.000)      |          466.575 (+-0.020)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        190.070 (+-0.082)        |         209.353 (+-1.096)          |            202.870 (+-0.888)            |     0.969 (+-0.000)      |          189.371 (+-0.164)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        190.021 (+-0.018)        |         210.504 (+-0.456)          |            201.814 (+-0.770)            |     0.959 (+-0.000)      |          189.314 (+-0.036)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        188.860 (+-0.207)        |         336.635 (+-0.023)          |            252.026 (+-0.510)            |     0.749 (+-0.000)      |          188.860 (+-0.170)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        188.725 (+-0.214)        |         276.329 (+-0.563)          |            251.439 (+-0.524)            |     0.910 (+-0.000)      |          188.776 (+-0.189)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)   |        781.879 (+-0.086)        |         836.389 (+-7.177)          |            816.483 (+-6.626)            |     0.976 (+-0.000)      |          781.362 (+-0.106)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)  |        781.824 (+-0.099)        |         840.406 (+-7.111)          |            807.530 (+-6.514)            |     0.961 (+-0.000)      |          781.307 (+-0.129)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456)       |        769.290 (+-0.309)        |         675.498 (+-1.537)          |            688.171 (+-4.326)            |     1.019 (+-0.000)      |          769.830 (+-0.222)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456)      |        769.240 (+-0.179)        |         675.800 (+-1.113)          |            673.176 (+-1.740)            |     0.996 (+-0.000)      |          769.935 (+-0.171)

Times are in microseconds (us).

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120411
Approved by: https://github.com/lezcano
2024-03-29 13:15:25 +00:00
andrewor14
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
BowenBao
8c96b4367a Remove opmath cast for im2col decomp (#121363)
It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance.

Sample script to demonstrate inserted cast before this change

```python
import torch
from torch._decomp.decompositions import im2col

def func(x):
    return torch.nn.functional.unfold(
        x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1
    )

x = torch.rand(1, 1, 5, 5, dtype=torch.float16)

eo = torch._dynamo.export(
    func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col}
)(x)
eo.graph_module.print_readable()
```

```
class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: "f16[1, 1, s0, s0]";

        arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        arg0_1 = arg0

        _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32)
        ...
        constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0);  _to_copy = None
        ...
        slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807);  constant_pad_nd = None
        slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
        index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]);  slice_2 = unsqueeze_5 = add_3 = None
        permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]);  index = None
        ...
        view: "f32[1, 3, s0**2 + 2*s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]);  permute = mul = None
        _to_copy_1: "f16[1, 3, s0**2 + 2*s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16);  view = None
        return pytree.tree_unflatten([_to_copy_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363
Approved by: https://github.com/lezcano
2024-03-09 15:37:27 +00:00
Boyuan Feng
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
andrewor14
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
PyTorch MergeBot
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
Kurt Mohler
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
Jane Xu
da559c98e3 Fix isin decomp and add python meta registration (#120821)
Fixes #119792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821
Approved by: https://github.com/malfet, https://github.com/peterbell10
2024-02-29 22:08:50 +00:00
PyTorch MergeBot
86ff31c4a0 Revert "Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)"
This reverts commit cabc09a5f2.

Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))
2024-02-28 22:30:18 +00:00
laith sakka
d21c6eb215 Do not wrap output with input device inside _to_copy (#119868)
Fixing https://github.com/pytorch/pytorch/issues/118790

This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689

The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor"
but its not always the case that a conversion in dtype causes a fake tensor to be a tensor.

When such conversion does not happen we get the following error
```
Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to
 a python object of type FakeTensor
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868
Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi
2024-02-28 01:51:43 +00:00
Kurt Mohler
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
Isuru Fernando
435063aa89 Decomposition for upsample_linear{1d, 3d} (#114774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774
Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10
2024-02-27 11:57:45 +00:00
Isuru Fernando
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
Angela Yi
6d82a7e9b0 Add pixel_shuffle to core aten decomps (#120092)
Summary:
https://github.com/pytorch/pytorch/pull/118239 added a decomposition
for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We
have also fixed the internal use case so that it no longer special cases on
pixel_shuffle, allowing us to revert the changes in
https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53860966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092
Approved by: https://github.com/ydwu4
2024-02-20 18:37:32 +00:00
lezcano
b97fa6ac30 Make roll a decomposition and remove its lowering (#119857)
We use the fact that we now propagate indexing properly to avoid having
to maintain two different implementations of the op. Doing this we also remove
a spurious guard on this op.

We move the ref into a decomp as we now use advanced indexing.
The only difference we did in the implementation is that we now use
advanced indexing rather than `torch.cat`.

We also remove it from core. Let's see how this goes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857
Approved by: https://github.com/peterbell10, https://github.com/larryliu0820
ghstack dependencies: #119863, #119864
2024-02-16 19:14:39 +00:00
PyTorch MergeBot
86dedebeaf Revert "Add pixel_shuffle to core aten decomps (#119899)"
This reverts commit 9201d7335a.

Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))
2024-02-16 17:44:59 +00:00
Andrew M. James
4625ecb858 Add decomp for linalg.cross (#119809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-02-16 09:58:38 +00:00
Angela Yi
9201d7335a Add pixel_shuffle to core aten decomps (#119899)
Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53766709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-02-14 21:01:11 +00:00
Aaron Meurer
5ce305270b Add a decomposition for isin() (#115390)
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390
Approved by: https://github.com/peterbell10
2024-02-14 03:03:42 +00:00
Edward Z. Yang
52de407b6c Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-13 15:56:59 +00:00
PyTorch MergeBot
472500e32a Revert "Avoid performing replacements when it would unrefine ranges (#117356)"
This reverts commit 0e6b314fc2.

Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))
2024-02-13 01:16:58 +00:00
vfdev-5
ed20e9118b Fixed hash issue in fx_graph_cse (#119567)
Description:
- Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))`

Repro code:
```python
import torch
from torch._functorch.compile_utils import fx_graph_cse

def func(inpt, osize):
    size = inpt.shape[-1]
    s1 = size - 1
    s2 = size - 1.0
    scale = s2 / (osize - 1.0)
    inpt = torch.clamp(inpt, 0, s1)
    return scale * inpt

gms = []
def toy_backend(gm, _):
    gms.append(gm)
    return gm.forward

torch._dynamo.reset()
fn = torch.compile(backend=toy_backend, dynamic=True)(func)
t = torch.rand(3, 100)
out = fn(t, 50)
gm = gms[0]

print(gm.graph)
new_fx_g = fx_graph_cse(gm.graph)
print(str(new_fx_g))
```
Original graph
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
New wrong graph where `sub_2` is replaced incorrectly with `sub`:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
With this PR the new graph is the following:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567
Approved by: https://github.com/eellison
2024-02-12 18:52:11 +00:00
Edward Z. Yang
0e6b314fc2 Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-09 14:43:58 +00:00
CaoE
dfdbd73360 add Half support for flash attention (#119247)
Re-open for https://github.com/pytorch/pytorch/pull/118368.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247
Approved by: https://github.com/drisspg, https://github.com/malfet
2024-02-07 05:57:41 +00:00
Edward Z. Yang
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
Mengwei Liu
1e4b408b02 [decomp] Add tests for different dtypes to SDPA decomposition (#119239)
Summary: As titled. Skipping torch.bfloat16 because for some reason the
difference is 0.01.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239
Approved by: https://github.com/drisspg
2024-02-06 11:17:07 +00:00
angelayi
1adedc3c86 [decomp] Remove pixel_shuffle from core aten decomps (#118921)
pixel_shuffle is a core aten op
(https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it.

https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure
(https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921
Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano
2024-02-03 08:21:32 +00:00
Yifu Wang
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
Elias Ellison
e87ac82c98 Fix missing default dim param in weight norm interface decomp (#118762)
Fix for https://github.com/pytorch/pytorch/issues/118742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762
Approved by: https://github.com/ezyang, https://github.com/shunting314
2024-01-31 22:10:10 +00:00
Isuru Fernando
81d12846dc Add decomp for pixel_shuffle/unshuffle (#118239)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239
Approved by: https://github.com/peterbell10
2024-01-31 18:34:21 +00:00
soulitzer
81b55f58ce Matmul decide should_fold using has_out instead of grad_mode (#118617)
Fixes https://github.com/pytorch/pytorch/issues/118548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617
Approved by: https://github.com/lezcano
2024-01-31 18:34:16 +00:00
Isuru Fernando
2f7839e6db register decomposition for rsub in torch._refs (#118288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288
Approved by: https://github.com/lezcano
ghstack dependencies: #118398
2024-01-30 22:18:15 +00:00
Catherine Lee
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Qingpeng Li
827949cef2 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/malfet
2024-01-30 13:24:13 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
albanD
24133e44b1 Fix return type hint for list types (#118238)
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
2024-01-25 23:35:20 +00:00
PyTorch MergeBot
8dc421a6b4 Revert "accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)"
This reverts commit 03b12e56c7.

Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))
2024-01-22 14:48:35 +00:00
Qingpeng Li
03b12e56c7 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/lezcano
2024-01-19 14:56:43 +00:00
Digant Desai
e2830e6328 [PyTorch] SDPA decomp: actually use attn_mask (#117579)
Summary: Need to pass this along

Test Plan:
```
cd ~/fbsource/fbcode/executorch/backends/xnnpack/test
buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa
buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32
```

Reviewed By: larryliu0820

Differential Revision: D52812369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579
Approved by: https://github.com/larryliu0820
2024-01-17 10:26:43 +00:00
vfdev-5
f6767244cf Added meta function for _upsample_bicubic2d_aa (#117347)
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
    return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E   torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E   aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E   from user code:
E      File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E       image = interpolate(
E
E   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E   You can suppress this exception and fall back to eager by setting:
E       import torch._dynamo
E       torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
2024-01-16 23:33:55 +00:00
Mengwei Liu
1a8545164a [export] Add unit test for SDPA export result (#117390)
Summary:

A follow up for #117097. In that PR I didn't add
`_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition
table. This PR does that and also add a unit test.

Test Plan: python test/export/test_export.py -k
test_scaled_dot_product_attention

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390
Approved by: https://github.com/drisspg
2024-01-14 00:21:28 +00:00
Aaron Orenstein
638f85fd67 Add default parameters to rrelu_with_noise() (#117141)
Summary:
rrelu_with_noise() was listed as having default parameters in the schema but the
actual code definition didn't have them.

The failing example was calling rrelu() which DOES have default parameters and
it passes those defaulted values to C++. Under the covers the C code was calling
the python version of rrelu_with_noise().

Although the C++ code was passing all the values to the python version of
rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema
and strips any parameters which match the schema's listed defaults so if the
schema shows defaults that aren't in the code it will be a problem.

Test Plan:
I added a unit test for this specific case. It would probably be better to write
a more general one to validate all the ops against their schemas - but I haven't
learned enough about the test harness to do that yet.

Fixes #115811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141
Approved by: https://github.com/yanboliang, https://github.com/oulgen
2024-01-12 05:32:13 +00:00
Elias Ellison
e3d4f4d14b [ProxyTensor] dedupe symbolic shapes in tracing (#116158)
Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158
Approved by: https://github.com/ezyang
2024-01-11 07:15:11 +00:00
Edward Z. Yang
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
Mengwei Liu
8783fe9cf3 [export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097)
Summary: As titled. #115913 added
`_scaled_dot_product_flash_attention_for_cpu` and the export result of
`scaled_dot_product_attention` includes this op. Adding this
decomposition so that it's being decomposed the same way as
`_scaled_dot_product_attention_math`.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097
Approved by: https://github.com/lezcano
2024-01-10 23:46:14 +00:00
Elias Ellison
d6540038c0 Fix 0-dim Index in Index Copy decomp (#117065)
Fix for https://github.com/pytorch/pytorch/issues/115931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-01-10 22:13:43 +00:00
Zhengxu Chen
b3f7fdbf0a Add decomp for pad_sequence (#116285)
Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence

Reviewed By: SherlockNoMad

Differential Revision: D52345667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-27 23:56:51 +00:00
Aaron Meurer
f08c4da86d Add a decomposition for take() (#114813)
Presumably this can close https://github.com/pytorch/pytorch/pull/109784

Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there).

There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-22 18:14:57 +00:00
vfdev-5
f727bed2e6 [inductor] Updated upsample_bilinear2d decomposition (#104182)
Description:
- Updated upsample_bilinear2d decomposition
  - added support for uint8 dtype support
  - code improvements
- Added uint8 dtype tests

Perf considerations:
- There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256)
- For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account.

## Perfs benchmarks

```
[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                    |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        565.212 (+-3.548)        |        1384.210 (+-10.798)         |           1230.996 (+-32.930)           |     0.889 (+-0.000)      |          566.253 (+-1.526)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |        565.404 (+-1.614)        |         1491.649 (+-7.763)         |            2974.959 (+-6.006)           |     1.994 (+-0.000)      |          566.476 (+-1.742)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        270.761 (+-0.861)        |         1557.777 (+-4.699)         |            1080.919 (+-4.243)           |     0.694 (+-0.000)      |          269.829 (+-0.986)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        270.960 (+-0.995)        |        1723.913 (+-12.433)         |            3191.938 (+-6.194)           |     1.852 (+-0.000)      |          269.962 (+-1.657)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |        1555.884 (+-5.169)       |         1178.753 (+-4.957)         |            1910.445 (+-5.988)           |     1.621 (+-0.000)      |          1560.804 (+-6.793)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |        1651.193 (+-6.952)       |         1323.466 (+-6.059)         |            3374.842 (+-8.168)           |     2.550 (+-0.000)      |          1653.497 (+-8.018)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |        978.482 (+-10.183)       |         1383.768 (+-4.341)         |            2147.841 (+-6.581)           |     1.552 (+-0.000)      |          979.983 (+-1.499)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        1074.472 (+-5.031)       |         1414.912 (+-5.754)         |           3590.968 (+-10.042)           |     2.538 (+-0.000)      |          1074.589 (+-3.948)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        2168.703 (+-8.964)       |        5400.528 (+-26.628)         |           4777.299 (+-11.891)           |     0.885 (+-0.000)      |          2168.133 (+-7.667)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |       2169.132 (+-12.618)       |        6583.866 (+-28.959)         |           11986.894 (+-45.838)          |     1.821 (+-0.000)      |         2174.488 (+-10.317)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        992.808 (+-6.086)        |         5985.028 (+-9.532)         |            4334.158 (+-9.423)           |     0.724 (+-0.000)      |          989.604 (+-5.499)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        987.618 (+-6.350)        |        6963.044 (+-28.885)         |           15441.096 (+-55.324)          |     2.218 (+-0.000)      |          985.573 (+-5.159)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |       6695.557 (+-35.067)       |        4657.603 (+-14.220)         |           8058.708 (+-41.684)           |     1.730 (+-0.000)      |         6714.996 (+-38.626)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |       7040.481 (+-39.486)       |        5445.704 (+-16.659)         |           13906.618 (+-53.298)          |     2.554 (+-0.000)      |         7034.453 (+-44.626)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |       3926.186 (+-10.660)       |        5741.433 (+-12.748)         |           9356.036 (+-40.848)           |     1.630 (+-0.000)      |         3930.598 (+-17.086)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        4308.536 (+-9.607)       |        6122.755 (+-47.278)         |           15637.567 (+-54.392)          |     2.554 (+-0.000)      |         4307.463 (+-11.268)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       2512.740 (+-10.860)       |         1573.590 (+-5.061)         |            451.355 (+-1.210)            |     0.287 (+-0.000)      |         2511.727 (+-10.930)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       2489.926 (+-11.915)       |         1537.233 (+-4.212)         |            2501.470 (+-7.446)           |     1.627 (+-0.000)      |         2500.000 (+-12.155)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |        632.032 (+-2.108)        |         1496.994 (+-4.194)         |            404.759 (+-1.064)            |     0.270 (+-0.000)      |          630.122 (+-4.086)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |        629.174 (+-4.386)        |         1708.935 (+-8.817)         |            2643.296 (+-9.723)           |     1.547 (+-0.000)      |          628.388 (+-1.326)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |        4409.941 (+-8.016)       |         1160.133 (+-4.698)         |            1897.089 (+-9.392)           |     1.635 (+-0.000)      |         4450.959 (+-10.438)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       4493.427 (+-11.703)       |         1329.226 (+-4.740)         |           2835.872 (+-12.241)           |     2.133 (+-0.000)      |          4506.973 (+-9.914)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        901.712 (+-4.071)        |         1320.739 (+-5.197)         |            2207.605 (+-8.219)           |     1.671 (+-0.000)      |          904.757 (+-4.558)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        990.080 (+-3.922)        |         1702.563 (+-7.909)         |           3074.196 (+-10.478)           |     1.806 (+-0.000)      |          990.482 (+-4.444)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       9785.550 (+-58.445)       |        6135.680 (+-33.569)         |           1628.572 (+-19.770)           |     0.265 (+-0.000)      |         9893.606 (+-62.377)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       9710.191 (+-57.597)       |        6066.824 (+-36.364)         |           10469.110 (+-42.775)          |     1.726 (+-0.000)      |         9919.022 (+-72.190)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |       2790.356 (+-12.188)       |        6134.101 (+-28.694)         |            1576.832 (+-6.030)           |     0.257 (+-0.000)      |         2761.122 (+-11.503)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |       2778.711 (+-13.603)       |        6608.528 (+-37.776)         |           10841.549 (+-49.429)          |     1.641 (+-0.000)      |         2753.037 (+-10.995)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |      45533.868 (+-102.618)      |         4962.994 (+-8.215)         |           9003.968 (+-38.179)           |     1.814 (+-0.000)      |        43531.261 (+-102.951)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       45932.699 (+-81.207)      |        5595.682 (+-11.482)         |           12302.907 (+-50.254)          |     2.199 (+-0.000)      |         43916.455 (+-80.468)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        3827.804 (+-8.057)       |        6311.580 (+-25.021)         |           11760.614 (+-51.531)          |     1.863 (+-0.000)      |         3849.959 (+-10.848)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        4169.007 (+-8.452)       |        6820.716 (+-35.310)         |           15264.633 (+-49.982)          |     2.238 (+-0.000)      |         4183.875 (+-19.104)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |        1306.914 (+-7.470)       |        10598.101 (+-38.410)        |           2678.031 (+-11.051)           |     0.253 (+-0.000)      |          1307.470 (+-8.519)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |        1307.268 (+-8.197)       |        10161.123 (+-45.643)        |           17148.842 (+-55.402)          |     1.688 (+-0.000)      |          1308.077 (+-8.553)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        548.574 (+-2.157)        |        10072.806 (+-41.368)        |            2408.971 (+-6.997)           |     0.239 (+-0.000)      |          547.726 (+-1.721)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |        546.664 (+-1.484)        |        11123.694 (+-43.636)        |           18058.070 (+-48.552)          |     1.623 (+-0.000)      |          547.151 (+-1.627)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       7935.051 (+-71.022)       |        7654.533 (+-29.512)         |           12414.194 (+-87.450)          |     1.622 (+-0.000)      |         7900.056 (+-53.997)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |       8546.732 (+-53.118)       |        8583.572 (+-35.656)         |          19111.824 (+-166.978)          |     2.227 (+-0.000)      |         8515.433 (+-63.300)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       6202.642 (+-34.355)       |        8915.622 (+-62.293)         |           14327.295 (+-52.188)          |     1.607 (+-0.000)      |         6213.329 (+-39.740)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       6811.128 (+-33.747)       |        9647.316 (+-50.837)         |           20830.594 (+-62.979)          |     2.159 (+-0.000)      |         6822.512 (+-37.092)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |       5079.586 (+-19.067)       |        42238.442 (+-87.643)        |           11282.141 (+-42.477)          |     0.267 (+-0.000)      |         5104.234 (+-17.706)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |       5079.575 (+-16.306)       |        41512.995 (+-83.710)        |          68789.816 (+-440.001)          |     1.657 (+-0.000)      |         5097.446 (+-21.724)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        2039.974 (+-8.614)       |       42322.773 (+-111.866)        |           10399.237 (+-43.140)          |     0.246 (+-0.000)      |         2043.808 (+-10.707)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |       2036.214 (+-10.083)       |        44353.281 (+-71.548)        |          73340.412 (+-324.780)          |     1.654 (+-0.000)      |          2039.000 (+-9.554)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       33821.523 (+-96.639)      |        30552.094 (+-65.023)        |          49494.486 (+-872.916)          |     1.620 (+-0.000)      |         33844.404 (+-92.466)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |      36196.104 (+-128.169)      |        34038.432 (+-79.697)        |          75761.226 (+-905.194)          |     2.226 (+-0.000)      |         36260.473 (+-94.642)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       24827.821 (+-77.335)      |        37006.218 (+-86.318)        |          61297.625 (+-898.192)          |     1.656 (+-0.000)      |         24823.275 (+-80.945)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       27266.138 (+-70.262)      |        40109.475 (+-94.248)        |          92086.075 (+-404.922)          |     2.296 (+-0.000)      |         27287.992 (+-89.507)

Times are in microseconds (us).

[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                      |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |         98.259 (+-0.014)        |          97.156 (+-0.008)          |             97.443 (+-0.031)            |     1.003 (+-0.000)      |           98.248 (+-0.021)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.048 (+-0.016)        |          97.480 (+-0.018)          |             96.819 (+-0.126)            |     0.993 (+-0.000)      |           97.045 (+-0.015)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |         97.944 (+-0.028)        |          91.686 (+-0.411)          |             93.894 (+-1.011)            |     1.024 (+-0.000)      |           97.933 (+-0.008)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |         98.008 (+-0.011)        |          91.205 (+-0.346)          |             96.854 (+-0.058)            |     1.062 (+-0.000)      |           97.203 (+-0.010)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |        384.318 (+-0.011)        |         382.793 (+-0.007)          |            382.472 (+-0.011)            |     0.999 (+-0.000)      |          384.701 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |        384.266 (+-0.009)        |         385.333 (+-0.024)          |            382.554 (+-0.022)            |     0.993 (+-0.000)      |          384.386 (+-0.016)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |        383.924 (+-0.011)        |         570.071 (+-0.030)          |            545.615 (+-0.051)            |     0.957 (+-0.000)      |          384.044 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |        384.184 (+-0.016)        |         560.857 (+-0.026)          |            552.447 (+-0.040)            |     0.985 (+-0.000)      |          384.063 (+-0.016)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        122.188 (+-0.053)        |         116.744 (+-1.006)          |            163.762 (+-0.015)            |     1.403 (+-0.000)      |          121.874 (+-0.015)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        122.156 (+-0.012)        |         182.692 (+-0.013)          |            161.653 (+-0.018)            |     0.885 (+-0.000)      |          121.926 (+-0.014)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        105.852 (+-0.324)        |         119.545 (+-0.294)          |            190.527 (+-0.023)            |     1.594 (+-0.000)      |          105.999 (+-0.446)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        106.507 (+-0.282)        |         120.060 (+-0.257)          |            162.330 (+-0.012)            |     1.352 (+-0.000)      |          106.567 (+-0.385)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        447.907 (+-0.015)        |         463.863 (+-1.779)          |            650.492 (+-0.331)            |     1.402 (+-0.000)      |          446.596 (+-0.017)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        447.750 (+-0.017)        |         723.832 (+-0.170)          |            641.539 (+-0.075)            |     0.886 (+-0.000)      |          446.467 (+-0.019)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        439.549 (+-0.031)        |         507.772 (+-2.879)          |            758.795 (+-0.482)            |     1.494 (+-0.000)      |          440.372 (+-0.025)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        439.538 (+-0.029)        |         509.260 (+-2.704)          |            654.195 (+-2.621)            |     1.285 (+-0.000)      |          440.362 (+-0.026)

Times are in microseconds (us).
```

[Source](f4751a3196/perf_interp_mode.py), [Output](899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182
Approved by: https://github.com/lezcano
2023-12-14 14:50:06 +00:00
angelayi
639060cb0b Use get_mkldnn_enabled for decompositions (#115448)
`torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448
Approved by: https://github.com/jgong5, https://github.com/ydwu4
2023-12-12 22:42:51 +00:00
Isuru Fernando
505574c46a Add decomposition for torch.block_diag (#115096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096
Approved by: https://github.com/peterbell10
2023-12-11 20:04:22 +00:00
Isuru Fernando
d40a7c6026 Add decompositions for replication_pad (#115113)
Fixes #115395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113
Approved by: https://github.com/peterbell10
2023-12-09 02:44:07 +00:00
Isuru Fernando
fb19947962 Add decompositions for reflection_pad{1, 2, 3}d (#115100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100
Approved by: https://github.com/peterbell10
2023-12-08 23:05:57 +00:00
Jason Ansel
7979ba7b43 [inductor] Add dropout type check to match eager (#115040)
Fixes #98970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115040
Approved by: https://github.com/oulgen
2023-12-03 23:05:02 +00:00
Kurt Mohler
6f32eb7eef Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
PyTorch MergeBot
013675ff59 Revert "Add decomp for replication_pad2d and use for CUDA deterministic (#111590)"
This reverts commit f1286161a6.

Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job.  The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))
2023-11-30 02:28:14 +00:00
Kurt Mohler
f1286161a6 Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-11-29 21:50:46 +00:00
Antonio Kim
7fc292930c Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-21 23:07:21 +00:00
vfdev-5
1f8d00c5a3 [inductor] Added decomposition for upsample_nearest_exact Nd (#113749)
Description:
- Added decomposition for upsample_nearest_exact: 1d, 2d, 3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113749
Approved by: https://github.com/lezcano
2023-11-21 13:03:47 +00:00
PyTorch MergeBot
fe428a284b Revert "Add torch._lazy_clone to create COW tensors (#113397)"
This reverts commit 9916d8a9ea.

Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))
2023-11-20 10:21:09 +00:00
GD06
b30580e121 [PT] Include tensor shape info in the error messages of torch split (#113984)
Summary: Include tensor shape info in the error messages of torch split.

Test Plan: CI

Differential Revision: D51436684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113984
Approved by: https://github.com/ezyang
2023-11-19 01:34:57 +00:00
Kurt Mohler
9916d8a9ea Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
ghstack dependencies: #113396
2023-11-17 01:58:51 +00:00
PyTorch MergeBot
252e68a83b Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 54493fe8c4.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))
2023-11-15 00:51:23 +00:00
Antonio Kim
54493fe8c4 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-13 23:18:14 +00:00
Mengwei Liu
5506b9db43 [decomp] Fix _scaled_dot_product_flash_attention decomposition bug (#113102)
For `_scaled_dot_product_flash_attention` we don't have

`Tensor? attn_mask=None`

but `scaled_dot_product_attention` has. In the original decomp there's a
mixup where I added this argument to
`_scaled_dot_product_flash_attention`.

Fix it so that `_scaled_dot_product_flash_attention` is being decomposed correctly.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113102
Approved by: https://github.com/ezyang
2023-11-08 21:47:37 +00:00
PyTorch MergeBot
9a28a7b498 Revert "Add support for torch.Generator type in TorchScript (#110413)"
This reverts commit 27e31ab6e8.

Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))
2023-11-07 15:53:32 +00:00
Antonio Kim
27e31ab6e8 Add support for torch.Generator type in TorchScript (#110413)
- Add support for `torch.Generator` type in TorchScript
- Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_`
- Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab)

CC: @eellison @davidberard98 @GlebKazantaev @behzad-a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98
2023-11-06 21:27:02 +00:00
Han Qi
5a6f8014c4 Add a decomposition for _weight_norm_interface. (#112193)
Fixes #112086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112193
Approved by: https://github.com/ezyang
2023-11-01 19:51:11 +00:00
Peter Bell
04024926f4 Use pytree.tree_map_ everywhere (#112417)
Wherever we discard the output of `tree_map` it's better to call `tree_map_`
which doesn't unflatten the mapped results and so is a lot cheaper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393, #112394
2023-10-31 15:57:06 +00:00
Peter Bell
66c32d099a Use pytree.arg_tree_leaves everywhere (#112394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393
2023-10-31 15:57:06 +00:00
Peter Bell
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
lezcano
c8a5bb451e Do not import sympy within torch._prims_common (#112034)
This is the first of a few PRs that avoid importing SymPy at import time.
The pitch here is that we (almost!) do not have SymPy on our API, so
this should be feasible.

This should speed-up torch imports by a good 15% as per
https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589

In this PR we just move a few global imports into local imports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034
Approved by: https://github.com/ezyang
2023-10-26 12:53:25 +00:00
PyTorch MergeBot
98c329b19e Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906)"
This reverts commit 9606cda64e.

Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))
2023-10-11 11:41:21 +00:00
SS-JIA
9606cda64e [core ATen IR] Add decompositions for max, min, var_mean (#110906)
## Context

Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators:

```
aten.max(x) -> return aten.amax(x), aten.argmax(x)
aten.min(x) -> return aten.amin(x), aten.argmin(x)
aten.var_mean(x) -> return aten.var(x), aten.mean(x)
```

For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano

Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906
Approved by: https://github.com/manuelcandales
2023-10-11 00:06:24 +00:00
Kazuaki Ishizaki
fde28fdc8c Fix typo under torch/_decomp directory (#110821)
This PR fixes typo of comments in files under `torch/_decomp` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110821
Approved by: https://github.com/Skylion007
2023-10-08 20:33:49 +00:00
Stephen Jia
c2e7a0d689 [core IR] Add decomps for aten.sum and aten.squeeze variants (#110645)
Summary:
## Context

Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant.

Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10

Test Plan: Github CI + Meta Internal CI

Differential Revision: D49965952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645
Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales
2023-10-07 04:21:51 +00:00
cdzhan
7cc0020a80 [decomp] Fix different return type in threshold_backward vs. eager (#110689)
due to type promotion with floating point scalar in decompositions.py

Fixes part of #100838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110689
Approved by: https://github.com/ezyang
2023-10-06 20:59:58 +00:00
chilli
ceb773b68d Fix #110680 (requires_grad typo in decomp) (#110687)
Fixes https://github.com/pytorch/pytorch/issues/110680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110687
Approved by: https://github.com/voznesenskym, https://github.com/lezcano
ghstack dependencies: #110501, #110504, #110591, #110668
2023-10-06 10:36:01 +00:00
Jerry Zhang
f2a1b93549 Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226)" (#110316)
Summary:
Original commit changeset: acdb5b34e3aa

Original Phabricator Diff: D47321689

Test Plan: opinfo tests in CI

Differential Revision: D49789403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316
Approved by: https://github.com/kimishpatel
2023-10-03 16:59:23 +00:00
Stephen Jia
ff96f6d04f [core IR][reland] Add split.Tensor and unbind decompositions to core ATen decomp table (#110323)
Summary:
This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102).

The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions.

Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)). Therefore decomposing `lift_fresh` will interfere with constant propogation during export.

Test Plan: Github CI and internal CI

Differential Revision: D49761321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323
Approved by: https://github.com/jansel
2023-10-03 14:35:04 +00:00
Peter Bell
be3b16daad [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-28 21:23:44 +00:00
PyTorch MergeBot
e0b035c220 Revert "[core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)"
This reverts commit 22e706f768.

Reverted https://github.com/pytorch/pytorch/pull/110102 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110102#issuecomment-1739856671))
2023-09-28 19:03:25 +00:00
SS-JIA
22e706f768 [core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)
## Context

Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly.

One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102
Approved by: https://github.com/jansel
2023-09-28 01:21:45 +00:00
SS-JIA
dec140f1ea [core IR] Add a core decomposition for aten.all (#110093)
## Context

Change the ref implementation of `aten.all` to only use other `torch` operators such that we can use it for the core ATen decomposition table. This will replace the decomposition for `aten.all` that was used specifically by Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110093
Approved by: https://github.com/manuelcandales, https://github.com/peterbell10, https://github.com/lezcano
2023-09-27 01:31:41 +00:00
SS-JIA
9928c10e71 [core IR] Add glu as a core decomposition (#110043)
## Context

Add the decomposition for `aten.glu` as a decomposition in the core ATen decomposition table. Don't use it in the Inductor decomposition table since Inductor has a lowering for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110043
Approved by: https://github.com/peterbell10, https://github.com/lezcano
ghstack dependencies: #110046
2023-09-27 00:23:05 +00:00
SS-JIA
5df8aca994 [core IR] Add a core decomposition for floor_divide (#110046)
## Context

Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table.

This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition

```
# TorchInductor-only decomposition. It should not be taken to core.
# See https://github.com/pytorch/torchdynamo/pull/1120
```

but couldn't discern the reason why this is the case. cc: @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046
Approved by: https://github.com/peterbell10
2023-09-26 08:39:21 +00:00
Mwiza Kunda
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
SS-JIA
7de669f2f9 [core IR] Remove trunc decomp and add trunc to core (#109902)
Following up from [this comment](https://github.com/pytorch/pytorch/pull/109319#discussion_r1330803226). Remove the decomposition for `trunc`, and add it as a core operator.

Going forward, provide similar treatment for operators that map cleanly to hardware instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109902
Approved by: https://github.com/peterbell10
2023-09-25 18:18:06 +00:00
Jijie Wei
334ead04a9 Back out "[decomp] Fix baddbmm decomposition (#109714)" (#109855)
Summary:
Original commit changeset: 95c462a380c9

Original Phabricator Diff: D49484954

this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus'

https://www.internalfb.com/intern/testinfra/testrun/16888498605839953

Differential Revision: D49527271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855
Approved by: https://github.com/yanboliang
2023-09-22 22:01:38 +00:00
Mwiza Kunda
8dedc9dd9b Add meta tests for layer/group/batch norm backward (#109591)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109591
Approved by: https://github.com/ezyang
2023-09-21 18:58:51 +00:00
Mwiza Kunda
6b7b9c796e Fix registering jit decompositions for jvp for out wrapped decomps (#109367)
Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since:
- `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter
- `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper

The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367
Approved by: https://github.com/lezcano
2023-09-21 16:36:51 +00:00
Peter Bell
6f0cf5a837 [decomp] Decompose unsafe_split{,_with_sizes} into safe variants (#109668)
The "safety" aspect refers to the output not being registered as aliasing the
input, but after AOTAutograd I don't think this distinction matters. However,
we shouldn't use the same decomposition as the safe variant in case the backend
doesn't want to decompose split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668
Approved by: https://github.com/lezcano
ghstack dependencies: #109667
2023-09-20 18:45:56 +00:00
Peter Bell
9e629dd73c [decomp] Add all std and std_mean overloads to core decompostions (#109667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109667
Approved by: https://github.com/lezcano
2023-09-20 18:45:56 +00:00
Peter Bell
36a8105f54 [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-20 18:40:21 +00:00
Salil Desai
40b2c796dc [Decomposition] baddbmm (#108534)
Summary:
Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions

ff38c0e2f9/torch/_inductor/decomposition.py (L203)

Test Plan: Phabricator + OSS Tests

Differential Revision: D48871741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534
Approved by: https://github.com/SherlockNoMad
2023-09-20 12:49:32 +00:00
Salil Desai
d0cc623192 [Decomposition] _unsafe_view (#108713)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[9d5eabd7b213d1a356d4e7bb400355d574ea924b]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=3091

Differential Revision: D48619079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108713
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-19 13:37:35 +00:00
Salil Desai
2e721aab98 [Decomposition] Trunc (#109319)
Summary:
Add Decomp for Trunc and add it to core_aten_decompositions

Differential Revision: D49042033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109319
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:30:13 +00:00
Salil Desai
ae66d0b3bf [Decomposition] clamp_max (#108718)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1855

Differential Revision: D48880026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108718
Approved by: https://github.com/SherlockNoMad
2023-09-19 13:25:35 +00:00
Salil Desai
fc47ba2794 [Decomposition] clamp_min (#108717)
Summary:
Decomp already exists so just add it to core_aten_decompositions

https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1846

Differential Revision: D48880080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108717
Approved by: https://github.com/SherlockNoMad
2023-09-18 12:43:58 +00:00
Salil Desai
a6d4cca7c0 [Decomposition] unsafe_split.Tensor (#108544)
Summary:
Include decomp in core_aten_decompositions

Decomp already exists

https://www.internalfb.com/code/fbsource/[03ff511cad587fc27ed8fd6a54b87845246e8e0c]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=1209

Test Plan: OSS + Phabricator Tests

Differential Revision: D48940445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108544
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 12:43:07 +00:00
Salil Desai
af93b29c5e [Decomposition] std.correction (#108733)
Summary:
Include decomp in core_aten_decompositions

Decomp:
https://www.internalfb.com/code/fbsource/[e69bf00ff87a55c9a30bd7905881661ff05fa211]/fbcode/caffe2/torch/_refs/__init__.py?lines=2398

Differential Revision: D48940402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108733
Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad
2023-09-18 11:38:23 +00:00
Jez Ng
db48bc80d9 Check index size during decomp of index_add (#108826)
This partially fixes the `test_index_add_correctness` test (#108181)
when run under inductor: it causes an exception to be raised [here][1]
as expected.

The test as a whole still cannot be made to pass under inductor because
the [last assert][2] still fails, likely due to #108798.

[1]: dec2b267d4/test/test_torch.py (L6049)
[2]: dec2b267d4/test/test_torch.py (L6051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108826
Approved by: https://github.com/eellison
2023-09-13 13:06:26 +00:00
Ken Jin
c458fa0d35 Decompose/add reference for view_as_complex (#108005)
Aten source: d4a99631dd/aten/src/ATen/native/ComplexHelper.h (L78)

Documentation reference:
https://pytorch.org/docs/stable/generated/torch.view_as_complex.html

Note: this adds a new primitive `view_of_dtype`, which is trivially implemented, as its meta function is already implemented elsewhere.

Finally, this is not registered as a decomposition (yet), because TorchInductor does not yet support complex types. It should be added once we do.

Closes https://github.com/pytorch/pytorch/issues/108020 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108005
Approved by: https://github.com/peterbell10, https://github.com/ezyang
2023-09-07 23:49:20 +00:00
Edward Z. Yang
9f37aec964 Add torch._check_is_size (#108685)
Check comments for what it does.  The key distinction is that if
you feed it an unbacked SymInt, we will also apply >= 2 assumption
at compile time.

This will get exercised when I reland
https://github.com/pytorch/pytorch/pull/107788

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108685
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-09-07 12:48:39 +00:00
Sam Larsen
27fe45eaf6 [inductor][easy] Enable Mypy Checking for torch/_inductor/decomposition.py (#108682)
Summary: Looks like one simple type mismatch between `get_decompositions()` and `remove_decompositions()`

Test Plan: `lintrunner torch/_inductor/decomposition.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108682
Approved by: https://github.com/eellison
2023-09-07 00:48:55 +00:00
Huy Do
5a4fe05a15 Revert "Force synced KJT to trace unbacked SymInt (#107788)" (#108684)
This reverts commit 3b92ef814d.  So let's manually revert it instead.

(Not sure why the bot doesn't work on https://github.com/pytorch/pytorch/pull/107788)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108684
Approved by: https://github.com/ezyang
2023-09-06 19:15:45 +00:00
Kimish Patel
ebed490c2f [sdpa decomp] change sdpa decomp to be consistent with flash attention (#108608)
Summary: See the comment in code for the reasons of the change

Test Plan:
buck2 test executorch/examples/export/test:test_export --
test_vit_export_to_executorch

Differential Revision: D48992180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608
Approved by: https://github.com/larryliu0820
2023-09-06 15:34:03 +00:00
Edward Z. Yang
3b92ef814d Force synced KJT to trace unbacked SymInt (#107788)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107788
Approved by: https://github.com/voznesenskym
2023-09-06 03:18:26 +00:00
Kimish Patel
cc50e654d4 [aten decomp] Update sdpa decom (#108371)
Summary:
Earlier decomp was routing _flash* variant to _match variant and this
was result in failure during torch.export, for some reason that I
couldnt trace.

However, it seems that we should really have a decomp for
scaled_dot_product_attention, instead of
scaled_dot_product_flash_attention. Right?

This diff adds that. Plus it adds a test to check if the model exported
via two stage export, has decomposed the op. This test needs improvement
to figur eout what the core aten opset is and check for anything that is
not inside.

Test Plan:
test_model_exports_to_core_aten

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371
Approved by: https://github.com/larryliu0820
2023-09-03 15:17:08 +00:00
lezcano
239ee76177 Add refs/decomps for dot/vdot (#108194)
Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194
Approved by: https://github.com/peterbell10
ghstack dependencies: #108188
2023-08-31 15:30:23 +00:00
rzou
0e4752bafc Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080)
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
  class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
  ProxyTorchDispatchMode handling in order to resolve decompositions. We
  can change this in the future so that they do not need to do this.

Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
2023-08-31 03:15:38 +00:00
chilli
39130c7433 Add reinplacing pass for scatters + incremental fake tensor updating (#106192)
mutation for params)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192
Approved by: https://github.com/jansel, https://github.com/eellison
2023-08-30 20:41:37 +00:00
Mengwei Liu
0fb1c05c5a [pytorch] Add decomp rule for scaled_dot_product_attention (#108180)
`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity.

However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor.

Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180
Approved by: https://github.com/SherlockNoMad
2023-08-30 15:52:08 +00:00
vfdev-5
0cfc5899f9 [inductor] Improved grid_sampler_2d decomposition for cuda (#104710)
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to https://github.com/pytorch/pytorch/issues/104296

Perfs:
- speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104710
Approved by: https://github.com/lezcano
2023-08-29 05:54:24 +00:00
Sam Larsen
20f3808aa2 Implement decomposition for aten.tensor_split.tensor_indices_or_sections (#107251)
Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem.

Test Plan:
Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-08-28 17:01:23 +00:00
ssjia
86f9fec3ac Avoid decomposing _unsafe_index in Inductor (#107882)
`_unsafe_index` was previously added to the core ATen decomp table in https://github.com/pytorch/pytorch/pull/106814, but this has performance ramifications for Inductor. Therefore, this diff removes it from the decomposition table used by Inductor.

Differential Revision: [D48649210](https://our.internmc.facebook.com/intern/diff/D48649210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107882
Approved by: https://github.com/SherlockNoMad
2023-08-25 04:51:53 +00:00
Vishwa Raj Singh
35de780aa6 Fix Inplace tensor update on transpose (#104689)
Fixes #https://github.com/pytorch/pytorch/issues/103650

- To align with HPU device backend architecture.
   Ensure all non-view ops return contiguous fake tensor outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689
Approved by: https://github.com/ezyang
2023-08-24 16:58:50 +00:00
Andrew Or
64d5851b1f make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)
Summary:
(From Brian Hirsh)

Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329

So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190)

There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops.

QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key.

If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key.

This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm.

Test Plan: python test/test_decomp.py

Differential Revision: D48607575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791
Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad
2023-08-24 15:19:07 +00:00
Sherlock Huang
ee4b99cc3a Decomp for aten.dropout (#106274)
When exporting dropout with cpu tensor, we get following graph module
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format)
            bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9);  empty_memory_format = None
            div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9);  bernoulli_p = None
            mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar);  arg0_1 = div_scalar = None
            return (mul_tensor,)
```

In addition, if we export with eval() mode, we will have an empty graph.

However, when exporting with cuda tensor, we got
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True);  arg0_1 = None
            getitem: f32[512, 10] = native_dropout_default[0];  native_dropout_default = None
            return (getitem,)
```
and exporting under eval() mode will still have a dropout node in graph.

This PR make exporting with CPU tensor also produce aten.native_dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274
Approved by: https://github.com/ezyang
2023-08-23 21:12:37 +00:00
Edward Z. Yang
5673c0874c Use expect_true to make split with unbacked sizes work. (#106788)
This pattern shows up in torchrec KeyedJaggedTensor.  Most
of the change in this PR is mechanical: whenever we failed
an unbacked symint test due to just error checking, replace the
conditional with something that calls expect_true (e.g.,
torch._check or TORCH_SYM_CHECK).

Some of the changes are a bit more nuanced, I've commented on the PR
accordingly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788
Approved by: https://github.com/lezcano
ghstack dependencies: #106720
2023-08-15 20:31:30 +00:00
lezcano
2c5f96deac [Inductor] Make softshrink composite implicit (#107052)
The backward is pretty much equivalent to the one we had written

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039, #107051
2023-08-14 21:01:50 +00:00
lezcano
3b1254e800 Make hardshrink's decomp composite implicit (#107039)
The generated code is the same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038
2023-08-14 21:01:50 +00:00
Sam Larsen
e165938853 Implement decomposition for aten.rrelu_with_noise (#106812)
Test Plan:
* Primarily, added new test in test/test_decomp.py
* Updated existing tests, e.g., to NOT expect failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812
Approved by: https://github.com/eellison
2023-08-11 19:18:29 +00:00
Stephen Jia
8c8477e55a Add _unsafe_index decomp (#106814)
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.

Also add it to the list of core decompositions.

Test Plan: contbuild and OSS CI (similar to D40075277)

Differential Revision: D48163393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
2023-08-10 23:23:37 +00:00
vfdev-5
35a1913370 [inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition

Related to https://github.com/pytorch/pytorch/issues/104296

Fixes https://github.com/pytorch/pytorch/issues/105565

Perfs:
- speed-up on cuda with bilinear and nearest modes

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           7.467 (+-0.036)            |             11.905 (+-0.276)            |             13.391 (+-0.051)            |     1.125 (+-0.000)      |           7.343 (+-0.036)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           7.722 (+-0.168)            |             14.371 (+-0.035)            |             15.899 (+-0.038)            |     1.106 (+-0.000)      |           7.870 (+-0.043)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |           7.710 (+-0.051)            |             11.354 (+-0.053)            |             13.376 (+-0.045)            |     1.178 (+-0.000)      |           7.698 (+-0.061)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |           7.870 (+-0.050)            |             13.744 (+-0.237)            |             15.206 (+-0.102)            |     1.106 (+-0.000)      |           7.912 (+-0.039)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           4.738 (+-0.015)            |             4.508 (+-0.005)             |             6.566 (+-0.027)             |     1.456 (+-0.000)      |           4.630 (+-0.022)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           4.391 (+-0.010)            |             4.860 (+-0.390)             |             6.438 (+-0.047)             |     1.325 (+-0.000)      |           4.458 (+-0.010)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |           4.279 (+-0.008)            |             4.127 (+-0.010)             |             6.598 (+-0.709)             |     1.599 (+-0.000)      |           5.064 (+-0.025)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |           4.537 (+-0.010)            |             4.593 (+-0.006)             |             6.365 (+-0.104)             |     1.386 (+-0.000)      |           4.480 (+-0.011)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           26.411 (+-0.066)           |             62.275 (+-0.436)            |             64.486 (+-0.353)            |     1.035 (+-0.000)      |           26.210 (+-0.110)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           26.457 (+-0.096)           |             72.887 (+-0.247)            |             74.207 (+-0.337)            |     1.018 (+-0.000)      |           25.995 (+-0.120)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |           26.457 (+-0.086)           |             64.110 (+-0.233)            |             66.340 (+-0.406)            |     1.035 (+-0.000)      |           26.145 (+-0.085)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |           26.536 (+-0.094)           |             73.742 (+-0.483)            |             71.946 (+-0.460)            |     0.976 (+-0.000)      |           26.457 (+-0.166)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           91.971 (+-0.253)           |             90.570 (+-0.193)            |            137.206 (+-0.214)            |     1.515 (+-0.000)      |           84.280 (+-0.241)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           91.893 (+-0.361)           |             89.866 (+-0.170)            |            136.678 (+-0.471)            |     1.521 (+-0.000)      |           84.573 (+-0.214)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |          116.967 (+-0.481)           |            110.468 (+-0.326)            |            223.770 (+-0.334)            |     2.026 (+-0.000)      |          108.098 (+-0.392)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |          117.563 (+-0.546)           |            111.438 (+-0.212)            |            223.101 (+-0.350)            |     2.002 (+-0.000)      |          108.225 (+-0.395)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           80.706 (+-0.289)           |             70.525 (+-0.204)            |            143.697 (+-0.311)            |     2.038 (+-0.000)      |           74.485 (+-0.258)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           80.955 (+-0.208)           |             69.986 (+-0.250)            |            143.658 (+-0.244)            |     2.053 (+-0.000)      |           74.163 (+-0.238)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |          117.576 (+-0.435)           |             71.179 (+-0.412)            |            178.515 (+-0.539)            |     2.508 (+-0.000)      |          108.394 (+-0.473)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |          117.441 (+-0.205)           |             70.313 (+-0.170)            |            178.664 (+-0.555)            |     2.541 (+-0.000)      |          108.098 (+-0.416)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           92.962 (+-0.509)           |            1740.964 (+-0.597)           |            1785.401 (+-0.369)           |     1.026 (+-0.000)      |           92.638 (+-0.539)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           92.928 (+-0.493)           |            1401.146 (+-0.732)           |            1453.229 (+-0.628)           |     1.037 (+-0.000)      |           92.458 (+-0.428)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |          118.152 (+-0.442)           |            1740.644 (+-0.480)           |            1793.475 (+-0.458)           |     1.030 (+-0.000)      |          107.962 (+-0.548)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |          118.182 (+-0.425)           |            1400.621 (+-0.624)           |            1461.796 (+-0.630)           |     1.044 (+-0.000)      |          107.894 (+-0.994)

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano
2023-08-10 09:52:48 +00:00
Andy Rock
aa1b2f16c5 fix upsample_nearest decompositions for uint8 tensors (#106675)
Fixes #106674.

This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: 68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675
Approved by: https://github.com/albanD
2023-08-07 01:52:41 +00:00
Kshiteej K
a899333ffc fix: nll_loss batch rule with negative ignore_idx (#106118)
We use python decompositions instead of writing our own for batching rules.

Fixes https://github.com/pytorch/pytorch/issues/105736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118
Approved by: https://github.com/lezcano, https://github.com/zou3519
2023-08-04 07:43:02 +00:00
chunyuan
cb6c3cbc91 inductor: enable weight prepack for LSTM (#103071)
- Enabled LSTM weight prepack in inductor.
- Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different.
- Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input.

**Update 2023-07-26:**
- https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in 3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)).
LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level.
- Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module.
When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here:
76fb72e24a/torch/nn/utils/stateless.py (L125-L128)
The forward function of LSTM will be called:
76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)
In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`:
76fb72e24a/torch/nn/modules/rnn.py (L775-L778)
The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here:
76fb72e24a/torch/nn/utils/stateless.py (L130-L142)
But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698))).
At this moment, if we need to recompile the model, `deepcopy` will be called:
76fb72e24a/torch/_dynamo/utils.py (L915-L917)
And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect:
76fb72e24a/torch/_subclasses/meta_utils.py (L514)
Added a fix in the `__get_state__`  to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-28 13:54:32 +00:00
lezcano
36ae359655 Update matmul decomp to match eager (#105850)
The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850
Approved by: https://github.com/Chillee
2023-07-26 09:24:51 +00:00
Nikita Karetnikov
45e4706aff [pt2] add decomps for multilabel_margin_loss_forward ops (#105302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
angelayi
fed8d3608d Update core aten decomp table (#105673)
Updated the decomposition table based on the existing [Core ATen IR](https://pytorch.org/docs/stable/ir.html) list, and moved rest of decompositions to inductor's decomposition table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105673
Approved by: https://github.com/SherlockNoMad
2023-07-21 02:45:37 +00:00
Yanbo Liang
8daed86e4e [Inductor] aten.dist decomposition (#105586)
Fixes #105557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586
Approved by: https://github.com/desertfire, https://github.com/Chillee
2023-07-20 06:42:44 +00:00
Justin Chu
8a688277a2 [BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432
Approved by: https://github.com/ezyang
2023-07-19 13:48:44 +00:00
QSHLGZ
07108ff1e8 Fix typos under _decomp directory (#105210)
Fix typos under _decomp directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210
Approved by: https://github.com/ezyang, https://github.com/Neilblaze
2023-07-17 11:41:30 +00:00
Peter Bell
9adfaf8807 [inductor] Add lowering for aten.unfold (#105165)
The decomposition for unfold uses `as_strided` which forces the input to be
realized. Instead, this implements it as a `GenericView` with reindexing
which removes the need to realize, though it does call `mark_reuse` incase
the input computation is expensive and the windows overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105165
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-16 13:09:23 +00:00
William Wen
5cd861fcf7 Add empty/empty_like to core aten decomps (#105158)
Fixes https://github.com/pytorch/pytorch/issues/104871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105158
Approved by: https://github.com/SherlockNoMad
2023-07-15 18:48:55 +00:00
Nikita Karetnikov
7e72126487 [pt2] add decomps for multi_margin_loss ops (#104578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-07-14 21:16:09 +00:00
Adnan Akhundov
4911b80b8e [inductor] addmm + ReLU / GELU fusion pass (#104132)
Summary:

Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.

Test Plan:

$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v

(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
  warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok

----------------------------------------------------------------------
Ran 1 test in 13.415s

OK

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
2023-07-10 16:44:14 +00:00
Jerry Zhang
1a661639f7 [quant] Support integer implementations for adaptive_avg_pool2d (#104226)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-07-07 19:36:31 +00:00
XiaobingSuper
d3589c9456 reduce computation of batch_norm when weight or bias is none (#104616)
For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616
Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5
2023-07-06 00:47:41 +00:00
Peter Bell
5c580a9846 [decomp] Add test tracking core ATen operators (#104262)
This adds an expect-test that finds the set of core ATen operators by
subtracting the operators with decomposition in core_aten_decompositions from the
set of all operators that have decompositions and could be decomposed.

This is useful because if you add a new decomposition but forget to add it to
the list of core decompositions, it will appear in the PR diff.

Also, by going through this list I have identified some operators where the
functional variant is decomposed, but not the inplace variant which must be an
oversight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104262
Approved by: https://github.com/lezcano
2023-07-04 16:41:44 +00:00
David Berard
0b62aca726 Don't decompose aten.bucketize (#104396)
torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4].

The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes.

Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104396
Approved by: https://github.com/Chillee
2023-06-30 05:05:08 +00:00
Peter Bell
8b418f197c [decomp] Add decomposition for torch.renorm (#103858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858
Approved by: https://github.com/ezyang, https://github.com/nkaretnikov
2023-06-21 20:57:43 +00:00
Peter Bell
591981c5e2 [inductor] Lower diagonal, diagonal_copy and diagonal_scatter (#103755)
Currently these are decomposed into `as_strided`, which forces a buffer to be
realized. Instead, this lowers them into a native inductor view node and so
doesn't require any buffers to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103755
Approved by: https://github.com/jansel
2023-06-21 20:16:24 +00:00
Peter Bell
a61096fb94 [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-21 20:16:24 +00:00
Peter Bell
61cd605813 [decomp] Don't call .item() in aten.fill.Tensor decomp (#103880)
Currently calling the fill.Tensor overload under `torch.compile` results in a
`DataDependentOutputException` due to the `.item()` call. This instead does a
device-device copy which can then be inlined into subsequent inductor kernels as
you would expect, e.g.

```python
def fn(a):
    result = torch.deg2rad(a).sin()
    return torch.empty((128, 128), device=a.device).fill_(result)
```

generates the single kernel
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16384
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset  + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (0))
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
    tmp2 = 0.017453292519943295
    tmp3 = tmp1 * tmp2
    tmp4 = tl.sin(tmp3)
    tl.store(out_ptr0 + (x0), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880
Approved by: https://github.com/Chillee
2023-06-21 18:45:04 +00:00
Kurt Mohler
ee83c646bb Replace _prims_common.check with torch._check* (#103240)
This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check*` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240
Approved by: https://github.com/albanD
2023-06-21 00:46:17 +00:00
PyTorch MergeBot
7b6dc72ffa Revert "[decomp] Decompose logaddexp2 (#103765)"
This reverts commit bab21d20eb.

Reverted https://github.com/pytorch/pytorch/pull/103765 on behalf of https://github.com/ezyang due to looks like land race ([comment](https://github.com/pytorch/pytorch/pull/103765#issuecomment-1599030496))
2023-06-20 15:35:02 +00:00
Peter Bell
bab21d20eb [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-20 09:24:21 +00:00
Ivan Zaitsev
821493715c Back out "Remove check from _prims_common, replace with torch._check* (#102219)", Back out "Forwatd fix for D46427687" (#103128)
Test Plan: revertitparrot

Reviewed By: malfet

Differential Revision: D46506433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128
Approved by: https://github.com/malfet
2023-06-07 01:41:41 +00:00
Kurt Mohler
a84bb2709a Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-03 02:23:21 +00:00
PyTorch MergeBot
a7efa0ce35 Revert "Remove check from _prims_common, replace with torch._check* (#102219)"
This reverts commit fb79d43649.

Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))
2023-06-02 20:00:48 +00:00
Kurt Mohler
fb79d43649 Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-02 19:13:45 +00:00
Aleksandar Samardžić
51e0f9e858 Add missing decompositons/lowerings for logical/bitwise operators (#102566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566
Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5
2023-06-02 14:27:17 +00:00
Nikita Karetnikov
c3ea8cc58b [pt2] convert out params in register_meta (#101344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101344
Approved by: https://github.com/lezcano
2023-05-27 18:38:52 +00:00
Animesh Jain
c2093de5d9 [partitioner] fix for rng ops (#102123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102123
Approved by: https://github.com/Chillee
2023-05-25 00:35:07 +00:00
Peter Bell
ce42010722 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-24 22:17:32 +00:00
vfdev-5
e3d97b6213 [inductor] Added smooth_l1_loss refs (#102077)
Added `smooth_l1_loss` to refs + tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102077
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-24 15:07:08 +00:00
Matthew Hoffman
29da75cc55 Enable mypy allow redefinition (#102046)
Related #101528

I tried to enable this in another PR but it uncovered a bunch of type errors: https://github.com/pytorch/pytorch/actions/runs/4999748262/jobs/8956555243?pr=101528#step:10:1305

The goal of this PR is to fix these errors.

---

This PR enables [allow_redefinition = True](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern:

> Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition.

`allow_redefinition` allows mypy to be more flexible by allowing reassignment to an existing variable with a different type... for instance (from the linked PR):

4a1e9230ba/torch/nn/parallel/data_parallel.py (L213)

A `Sequence[Union[int, torch.device]]` is narrowed to `Sequence[int]` thru reassignment to the same variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102046
Approved by: https://github.com/ezyang
2023-05-24 07:05:30 +00:00
PyTorch MergeBot
5147fe4969 Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)"
This reverts commit b9721bd705.

Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))
2023-05-23 23:06:21 +00:00
Peter Bell
b9721bd705 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-22 20:39:18 +00:00
Jason Ansel
0c6f409cda [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-20 03:43:33 +00:00
lezcano
1930428d89 Minor improvement on the decomposition of upsample_bilinear (#101682)
This is how it's done in core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101682
Approved by: https://github.com/ngimel
2023-05-18 16:51:51 +00:00
Peter Bell
66e398951a [inductor/decomp] Add aten._unsafe_index to disable range checks (#101602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-17 23:36:24 +00:00
PyTorch MergeBot
5f07c589b0 Revert "[inductor] Refactor RNG operators (#100064)"
This reverts commit 3bbf0683a1.

Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))
2023-05-17 21:16:41 +00:00
Jason Ansel
3bbf0683a1 [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-17 01:29:31 +00:00
Thibaut Durand
01da732691 Fix type annotation of torch.split (#100655)
The type annotation indicates `list` but the returned type is `tuple`
```python
>>> import torch
>>> type(torch.arange(10).split(4))
<class 'tuple'>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100655
Approved by: https://github.com/kit1980
2023-05-16 21:35:41 +00:00
Jiong Gong
788ff0623b [decomp] fix decomp of batch_norm when weight/bias is not flattened (#101059)
Fix https://github.com/pytorch/pytorch/issues/100970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101059
Approved by: https://github.com/ezyang
2023-05-16 00:00:34 +00:00
Animesh Jain
e1021ec535 [decomp] Bad accuracy for elu_backward (#100284)
Accuracy is tested by the full model at https://github.com/pytorch/pytorch/issues/100061
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100284
Approved by: https://github.com/ngimel
2023-04-29 04:21:20 +00:00
Bin Bao
b66d7007d8 Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267)
Summary: https://github.com/pytorch/pytorch/pull/100242 didn't cover all
test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100267
Approved by: https://github.com/jansel
2023-04-28 19:32:17 +00:00
yhl48
07c02b9e92 Add vmap support for smooth_l1_loss_backward (#99429)
Follow-up of #98357
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99429
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-04-28 10:58:07 +00:00
Animesh Jain
a8ad0dc333 [philox_rand] Add decomps (#100206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100206
Approved by: https://github.com/ngimel
2023-04-28 02:20:13 +00:00
Angela Yi
d06b93b0c7 Decompose arange.default to arange.start_step (#99739)
The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739
Approved by: https://github.com/SherlockNoMad
2023-04-27 19:06:36 +00:00
Animesh Jain
539363a873 [inductor] Lowering of rngprims philox_rand (#99289)
An example graph with Dynamic shapes on

`arg0_1` is seed, `arg1_1` is base offset.
~~~
  ===== Forward graph 0 =====
 <eval_with_key>.5 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: f32[s0]):
        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4605, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0] = torch.ops.aten.mul.Tensor(getitem, arg3_1);  getitem = arg3_1 = None

        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4606, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        add_5: i64[] = torch.ops.aten.add.Tensor(add_4, 3);  add_4 = None
        div: i64[] = torch.ops.aten.div.Tensor_mode(add_5, 4, rounding_mode = 'floor');  add_5 = None
        mul_2: i64[] = torch.ops.aten.mul.Tensor(div, 4);  div = None
        return (mul_1, mul_2)

~~~

Note that the output `mul2` is basically total `numel` of the random ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99289
Approved by: https://github.com/jansel
2023-04-26 01:22:41 +00:00
Animesh Jain
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
XiaobingSuper
41069f2faa inductor: align inductor behavior with eager mode for split_with_sizes (#99702)
Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR:

```
Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node
    return node.target(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split
    return tensor.split(split_size_or_sections, dim)
  File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split
    return torch._VF.split_with_sizes(self, split_size, dim)
  File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch
    return decomposition_table[func](*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes
    raise ValueError(
ValueError: Split sizes don't add up to the tensor's size in the given dimension

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value
    return wrap_fake_exception(
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception
    return fn()
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node
    raise RuntimeError(
RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>(*(FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}):
Split sizes don't add up to the tensor's size in the given dimension
(scroll up for backtrace)

The above exception was the direct cause of the following exception:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel
2023-04-25 01:13:52 +00:00
Will Constable
63690afc6c Make CI error on inductor fallback when decomp is available (#99473)
Fixes #99446

Remove the warning, as that annoyed end-users who don't know what to do about it.

Instead, try to hold the line by preventing any decomp from being added without making
the corresponding change to inductor's fallbacks.

Note: we probably still need to better document how to update inductor's decomps,
for now it's pretty much "go ask the inductor team for advice"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99473
Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel
2023-04-21 05:47:28 +00:00
PyTorch MergeBot
5cb788a9a5 Revert "[cuda rng] Making offset calculation independent of device properties (#98988)"
This reverts commit 26f318574f.

Reverted https://github.com/pytorch/pytorch/pull/98988 on behalf of https://github.com/anijain2305 due to Diagnosing if sebotnet has flakiness
2023-04-19 17:23:40 +00:00
Animesh Jain
26f318574f [cuda rng] Making offset calculation independent of device properties (#98988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98988
Approved by: https://github.com/ngimel
2023-04-19 01:35:44 +00:00
Animesh Jain
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
Peter Bell
7b91bd2a7b [primTorch] Add count_nonzero (#98995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98995
Approved by: https://github.com/lezcano
2023-04-13 22:08:19 +00:00