pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit `f03f8bc901`. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Peter Bell	39de62845a	[decomp] Fix default values missing from inplace `rrelu` decomposition (#126978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978 Approved by: https://github.com/lezcano	2024-05-26 23:49:40 +00:00
Andres Lugo-Reyes	38b8b614a2	[ROCm] Implement forward AD for miopen_batch_norm (#125069 ) Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069 Approved by: https://github.com/jeffdaily, https://github.com/andrewor14	2024-05-14 19:09:50 +00:00
Edward Z. Yang	4731130ea8	Add a code comment about torch._check_is_size in tensor_split (#125292 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125292 Approved by: https://github.com/albanD	2024-05-02 02:25:38 +00:00
Aaron Orenstein	a8574a9719	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-26 15:35:53 +00:00
PyTorch MergeBot	1ac60484c1	Revert "Fix global flake8 issues (#124771 )" This reverts commit `f01275934b`. Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
Aaron Orenstein	f01275934b	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-25 14:25:00 +00:00
Peter Bell	58806d6531	[decomp] Remove dead device_hint function (#124849 ) The only use of this function is in `_to_copy` but the result is never used, so this is just dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124849 Approved by: https://github.com/lezcano	2024-04-25 11:25:51 +00:00
Isuru Fernando	edcd968b51	Add out wrappers to some decompositions (#115437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437 Approved by: https://github.com/lezcano	2024-04-23 06:26:11 +00:00
vfdev-5	6330acae76	Refactored implementation for upsample_nearest decompostions (#122783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122783 Approved by: https://github.com/peterbell10	2024-04-17 23:05:40 +00:00
Edward Z. Yang	60d7fbe89a	Register matmul out variant so it is used (#122979 ) Fixes https://github.com/pytorch/pytorch/issues/122774 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122979 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2024-04-09 22:21:37 +00:00
Andrew M. James	bde1a93bc4	Add lowering for resize, decomp for resize_as. (#122317 ) This has been split off from #121354 as the inplace version of these methods prove to be rather tricky. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-04-03 17:47:29 +00:00
vfdev-5	38946bff51	Added DispatchKey.CompositeImplicitAutograd to all upsample_nearest*.default decompositions (#122782 ) Related to https://github.com/pytorch/pytorch/pull/117632#issuecomment-2021321172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122782 Approved by: https://github.com/ezyang	2024-03-29 13:55:25 +00:00
vfdev-5	b524a404e0	Fixed support for uint8 in upsample bicubic2d decomposition (#120411 ) Superseeds https://github.com/pytorch/pytorch/pull/104248 Description: - Fixed support for uint8 for upsample bicubic2d decomposition (on `main` results are wrong, so we can tolerate the slowdown) - Added missing clamp(0, 1) for xscale and yscale - slowdown for f32 on cpu. PR on nodes fusion on CPU: https://github.com/pytorch/pytorch/pull/120077 can help for upsampling cases with align corners = true - the slowdown mainly due to the added clamp op and also partially reduced when using torch.stack in weights computation on cpu. - Removed lowering implementation Benchmarks: ``` [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git069270d) Nightly \| speed-up PR vs Nightly \| Eager (2.4.0a0+git069270d) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 613.029 (+-1.590) \| 5477.608 (+-9.027) \| 3060.314 (+-12.368) \| 0.559 (+-0.000) \| 608.735 (+-6.336) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 610.176 (+-1.428) \| 5718.503 (+-11.203) \| 3424.022 (+-12.836) \| 0.599 (+-0.000) \| 604.781 (+-6.229) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 325.001 (+-0.840) \| 6183.029 (+-10.893) \| 3275.032 (+-7.625) \| 0.530 (+-0.000) \| 325.693 (+-1.067) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 325.855 (+-1.108) \| 6391.394 (+-11.552) \| 3533.410 (+-7.666) \| 0.553 (+-0.000) \| 325.838 (+-1.457) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 2521.533 (+-14.857) \| 5025.217 (+-13.415) \| 2814.304 (+-6.742) \| 0.560 (+-0.000) \| 2520.308 (+-10.796) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 2531.204 (+-12.534) \| 5294.925 (+-11.994) \| 3147.590 (+-6.808) \| 0.594 (+-0.000) \| 2521.228 (+-11.732) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 758.352 (+-10.362) \| 5639.912 (+-14.495) \| 3014.123 (+-8.799) \| 0.534 (+-0.000) \| 756.114 (+-4.792) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 758.712 (+-5.781) \| 5927.541 (+-9.982) \| 3249.555 (+-7.226) \| 0.548 (+-0.000) \| 757.719 (+-5.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 1524.469 (+-12.860) \| 34321.641 (+-80.310) \| 19373.714 (+-56.351) \| 0.564 (+-0.000) \| 1518.082 (+-49.653) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 1521.746 (+-13.780) \| 35949.711 (+-81.010) \| 21782.366 (+-68.938) \| 0.606 (+-0.000) \| 1467.911 (+-15.901) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 712.311 (+-5.361) \| 38826.510 (+-92.267) \| 20762.314 (+-59.303) \| 0.535 (+-0.000) \| 712.669 (+-4.673) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 715.060 (+-4.757) \| 40269.353 (+-92.543) \| 22402.114 (+-81.574) \| 0.556 (+-0.000) \| 716.001 (+-8.945) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 2331.889 (+-29.159) \| 21541.096 (+-72.346) \| 12181.194 (+-45.288) \| 0.565 (+-0.000) \| 2304.864 (+-21.351) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 2333.697 (+-10.066) \| 22514.154 (+-57.798) \| 21709.449 (+-98.307) \| 0.964 (+-0.000) \| 2302.141 (+-13.041) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 1198.768 (+-5.364) \| 37652.371 (+-101.644) \| 42740.413 (+-98.571) \| 1.135 (+-0.000) \| 1197.104 (+-7.225) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 1196.851 (+-5.118) \| 39678.341 (+-173.750) \| 46807.738 (+-92.744) \| 1.180 (+-0.000) \| 1189.322 (+-5.681) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 10020.978 (+-54.855) \| 19955.290 (+-71.891) \| 11420.521 (+-53.179) \| 0.572 (+-0.000) \| 9999.583 (+-61.230) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 10066.441 (+-62.700) \| 21058.334 (+-183.414) \| 19986.577 (+-65.304) \| 0.949 (+-0.000) \| 10018.672 (+-59.188) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 3171.135 (+-14.635) \| 19687.864 (+-54.320) \| 23313.699 (+-57.391) \| 1.184 (+-0.000) \| 3182.191 (+-17.686) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 3181.314 (+-13.784) \| 20224.468 (+-50.827) \| 30541.963 (+-381.385) \| 1.510 (+-0.000) \| 3183.578 (+-16.203) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 5879.450 (+-31.551) \| 136918.555 (+-480.320) \| 77723.568 (+-331.766) \| 0.568 (+-0.000) \| 5726.061 (+-87.517) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 5882.869 (+-30.325) \| 143378.094 (+-513.842) \| 137244.074 (+-4827.730) \| 0.957 (+-0.000) \| 5727.679 (+-22.164) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 2674.937 (+-45.003) \| 244829.360 (+-1930.579) \| 271283.073 (+-2243.245) \| 1.108 (+-0.000) \| 2676.054 (+-24.632) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 2676.217 (+-16.601) \| 248658.668 (+-2904.952) \| 296514.520 (+-2983.281) \| 1.192 (+-0.000) \| 2682.844 (+-19.886) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 1768.437 (+-6.294) \| 2934.013 (+-28.870) \| 2520.649 (+-6.797) \| 0.859 (+-0.000) \| 1759.292 (+-5.097) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 1748.660 (+-5.550) \| 3271.104 (+-7.557) \| 2891.306 (+-7.632) \| 0.884 (+-0.000) \| 1746.341 (+-5.845) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 2813.150 (+-6.656) \| 3258.973 (+-7.543) \| 2766.286 (+-6.473) \| 0.849 (+-0.000) \| 2805.077 (+-7.611) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 2812.102 (+-8.211) \| 3568.780 (+-9.018) \| 3125.870 (+-7.324) \| 0.876 (+-0.000) \| 2834.178 (+-9.034) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 1687.975 (+-9.527) \| 2752.085 (+-9.627) \| 2373.274 (+-7.888) \| 0.862 (+-0.000) \| 1698.782 (+-8.098) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 1696.606 (+-8.678) \| 3056.317 (+-13.303) \| 2699.160 (+-10.638) \| 0.883 (+-0.000) \| 1684.942 (+-10.519) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 2613.491 (+-9.769) \| 3176.493 (+-13.366) \| 2730.193 (+-9.573) \| 0.859 (+-0.000) \| 2625.085 (+-9.943) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 2614.946 (+-34.129) \| 3465.398 (+-11.165) \| 3044.396 (+-11.447) \| 0.879 (+-0.000) \| 2627.355 (+-9.608) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 10784.549 (+-58.181) \| 18292.452 (+-59.344) \| 15909.922 (+-49.864) \| 0.870 (+-0.000) \| 10837.656 (+-51.947) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 10786.513 (+-52.308) \| 20449.038 (+-56.204) \| 18295.997 (+-54.522) \| 0.895 (+-0.000) \| 10843.751 (+-44.781) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 17532.699 (+-64.807) \| 20425.699 (+-80.271) \| 17517.040 (+-79.705) \| 0.858 (+-0.000) \| 17595.597 (+-61.870) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 17530.816 (+-55.131) \| 22450.080 (+-92.899) \| 19827.828 (+-77.649) \| 0.883 (+-0.000) \| 17615.934 (+-71.716) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 6875.484 (+-40.543) \| 11569.509 (+-62.462) \| 10053.350 (+-208.136) \| 0.869 (+-0.000) \| 6864.501 (+-46.747) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 6843.126 (+-44.498) \| 12915.236 (+-60.654) \| 25335.058 (+-382.640) \| 1.962 (+-0.000) \| 6899.002 (+-46.861) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (256, 256) \| 11103.418 (+-51.318) \| 28834.389 (+-78.395) \| 37405.463 (+-581.646) \| 1.297 (+-0.000) \| 11223.012 (+-60.709) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (256, 256) \| 11092.994 (+-70.835) \| 36597.023 (+-118.988) \| 45761.267 (+-85.051) \| 1.250 (+-0.000) \| 11104.014 (+-61.288) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 7106.791 (+-63.666) \| 11191.071 (+-45.402) \| 9786.037 (+-75.781) \| 0.874 (+-0.000) \| 7129.419 (+-77.674) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 7146.519 (+-28.376) \| 12443.571 (+-39.425) \| 20147.067 (+-74.771) \| 1.619 (+-0.000) \| 7179.622 (+-64.847) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (200, 300) \| 10533.849 (+-44.227) \| 34814.909 (+-138.127) \| 42803.001 (+-114.326) \| 1.229 (+-0.000) \| 10644.039 (+-59.681) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (200, 300) \| 10548.910 (+-44.221) \| 42876.940 (+-146.959) \| 49711.443 (+-139.276) \| 1.159 (+-0.000) \| 10652.375 (+-44.174) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 42814.521 (+-103.198) \| 73100.489 (+-435.262) \| 63587.659 (+-134.266) \| 0.870 (+-0.000) \| 43208.921 (+-195.287) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 42812.373 (+-103.870) \| 81769.160 (+-373.369) \| 175159.813 (+-2028.558) \| 2.142 (+-0.000) \| 43007.691 (+-96.358) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (600, 700) \| 69955.505 (+-373.373) \| 215248.616 (+-2040.775) \| 267511.246 (+-2094.161) \| 1.243 (+-0.000) \| 70382.679 (+-594.941) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (600, 700) \| 69852.157 (+-490.076) \| 242841.484 (+-19645.513) \| 317931.678 (+-2016.498) \| 1.309 (+-0.000) \| 70074.819 (+-352.919) Times are in microseconds (us). [-------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git0c61c20) PR \| Compiled (2.4.0a0+git069270d) Nightly \| speed-up PR vs Nightly \| Eager (2.4.0a0+git069270d) Nightly 1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 97.727 (+-0.018) \| 97.765 (+-0.025) \| 97.773 (+-0.027) \| 1.000 (+-0.000) \| 97.905 (+-0.040) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 97.615 (+-0.066) \| 97.332 (+-0.032) \| 97.950 (+-0.026) \| 1.006 (+-0.000) \| 97.690 (+-0.062) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 100.635 (+-0.033) \| 125.883 (+-0.020) \| 102.499 (+-0.116) \| 0.814 (+-0.000) \| 101.103 (+-0.027) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 100.898 (+-0.036) \| 109.717 (+-0.336) \| 102.558 (+-0.120) \| 0.935 (+-0.000) \| 101.642 (+-0.105) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 462.853 (+-0.028) \| 382.475 (+-0.047) \| 382.472 (+-0.033) \| 1.000 (+-0.000) \| 462.188 (+-0.014) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 462.783 (+-0.021) \| 382.806 (+-0.037) \| 382.563 (+-0.043) \| 0.999 (+-0.000) \| 462.089 (+-0.028) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (1234, 1345) \| 466.721 (+-0.022) \| 384.438 (+-0.027) \| 384.886 (+-0.037) \| 1.001 (+-0.000) \| 467.014 (+-0.025) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (1234, 1345) \| 466.993 (+-0.032) \| 384.212 (+-0.009) \| 383.946 (+-0.029) \| 0.999 (+-0.000) \| 466.575 (+-0.020) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 190.070 (+-0.082) \| 209.353 (+-1.096) \| 202.870 (+-0.888) \| 0.969 (+-0.000) \| 189.371 (+-0.164) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 190.021 (+-0.018) \| 210.504 (+-0.456) \| 201.814 (+-0.770) \| 0.959 (+-0.000) \| 189.314 (+-0.036) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 188.860 (+-0.207) \| 336.635 (+-0.023) \| 252.026 (+-0.510) \| 0.749 (+-0.000) \| 188.860 (+-0.170) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 188.725 (+-0.214) \| 276.329 (+-0.563) \| 251.439 (+-0.524) \| 0.910 (+-0.000) \| 188.776 (+-0.189) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 781.879 (+-0.086) \| 836.389 (+-7.177) \| 816.483 (+-6.626) \| 0.976 (+-0.000) \| 781.362 (+-0.106) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 781.824 (+-0.099) \| 840.406 (+-7.111) \| 807.530 (+-6.514) \| 0.961 (+-0.000) \| 781.307 (+-0.129) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: True, antialias: False, osize: (2345, 2456) \| 769.290 (+-0.309) \| 675.498 (+-1.537) \| 688.171 (+-4.326) \| 1.019 (+-0.000) \| 769.830 (+-0.222) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bicubic, align_corners: False, antialias: False, osize: (2345, 2456) \| 769.240 (+-0.179) \| 675.800 (+-1.113) \| 673.176 (+-1.740) \| 0.996 (+-0.000) \| 769.935 (+-0.171) Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120411 Approved by: https://github.com/lezcano	2024-03-29 13:15:25 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
BowenBao	8c96b4367a	Remove opmath cast for im2col decomp (#121363 ) It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance. Sample script to demonstrate inserted cast before this change ```python import torch from torch._decomp.decompositions import im2col def func(x): return torch.nn.functional.unfold( x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1 ) x = torch.rand(1, 1, 5, 5, dtype=torch.float16) eo = torch._dynamo.export( func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col} )(x) eo.graph_module.print_readable() ``` ``` class GraphModule(torch.nn.Module): def forward(self, x): arg0: "f16[1, 1, s0, s0]"; arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec) arg0_1 = arg0 _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32) ... constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0); _to_copy = None ... slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807); constant_pad_nd = None slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807); slice_1 = None index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]); slice_2 = unsqueeze_5 = add_3 = None permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]); index = None ... view: "f32[1, 3, s0*2 + 2s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]); permute = mul = None _to_copy_1: "f16[1, 3, s0*2 + 2s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16); view = None return pytree.tree_unflatten([_to_copy_1], self._out_spec) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363 Approved by: https://github.com/lezcano	2024-03-09 15:37:27 +00:00
Boyuan Feng	35d3adb4b0	Add ATen Op _chunk_cat and _chunk_cat.out (#121081 ) # Motivation In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0. ### Example 1 Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2): Input tensors: ``` AAAA BBB CC AAAA BBB BBB ``` Reduce-scatter-copy-in Output: ``` AAAABBBCC AAAABBB00 0000BBB00 ``` ### Example 2 Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2): Input tensors: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Reduce-scatter-copy-in first pad: ``` AAAA BBB CC DD AAAA BBB 00 DD BBB DD 000 DD ``` Then chunk and cat along dim as the output: ``` AAAABBBBBBCCDDDD AAAABBB00000DDDD ``` The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance. # PR We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`: ``` _chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor ``` This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops. In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark. ## Requirements on input 1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim. 2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension. 3. Expect positive num_chunks 4. Expect non-empty input tensor list and each input tensor should have at least 1 element Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081 Approved by: https://github.com/albanD	2024-03-08 21:48:12 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Kurt Mohler	13a54ce279	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-03-01 05:05:28 +00:00
Jane Xu	da559c98e3	Fix isin decomp and add python meta registration (#120821 ) Fixes #119792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821 Approved by: https://github.com/malfet, https://github.com/peterbell10	2024-02-29 22:08:50 +00:00
PyTorch MergeBot	86ff31c4a0	Revert "Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 )" This reverts commit `cabc09a5f2`. Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))	2024-02-28 22:30:18 +00:00
laith sakka	d21c6eb215	Do not wrap output with input device inside _to_copy (#119868 ) Fixing https://github.com/pytorch/pytorch/issues/118790 This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689 The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor" but its not always the case that a conversion in dtype causes a fake tensor to be a tensor. When such conversion does not happen we get the following error ``` Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to a python object of type FakeTensor ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868 Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi	2024-02-28 01:51:43 +00:00
Kurt Mohler	cabc09a5f2	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-02-28 00:37:33 +00:00
Isuru Fernando	435063aa89	Decomposition for upsample_linear{1d, 3d} (#114774 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774 Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10	2024-02-27 11:57:45 +00:00
Isuru Fernando	b7df3bba62	add decomposition for frexp (#119217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217 Approved by: https://github.com/peterbell10 ghstack dependencies: #119284, #120027	2024-02-23 21:52:42 +00:00
Angela Yi	6d82a7e9b0	Add pixel_shuffle to core aten decomps (#120092 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53860966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092 Approved by: https://github.com/ydwu4	2024-02-20 18:37:32 +00:00
lezcano	b97fa6ac30	Make roll a decomposition and remove its lowering (#119857 ) We use the fact that we now propagate indexing properly to avoid having to maintain two different implementations of the op. Doing this we also remove a spurious guard on this op. We move the ref into a decomp as we now use advanced indexing. The only difference we did in the implementation is that we now use advanced indexing rather than `torch.cat`. We also remove it from core. Let's see how this goes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857 Approved by: https://github.com/peterbell10, https://github.com/larryliu0820 ghstack dependencies: #119863, #119864	2024-02-16 19:14:39 +00:00
PyTorch MergeBot	86dedebeaf	Revert "Add pixel_shuffle to core aten decomps (#119899 )" This reverts commit `9201d7335a`. Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))	2024-02-16 17:44:59 +00:00
Andrew M. James	4625ecb858	Add decomp for linalg.cross (#119809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-02-16 09:58:38 +00:00
Angela Yi	9201d7335a	Add pixel_shuffle to core aten decomps (#119899 ) Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921. Test Plan: CI Differential Revision: D53766709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-02-14 21:01:11 +00:00
Aaron Meurer	5ce305270b	Add a decomposition for isin() (#115390 ) Co-authored-by: Peter Bell <peterbell10@live.co.uk> Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390 Approved by: https://github.com/peterbell10	2024-02-14 03:03:42 +00:00
Edward Z. Yang	52de407b6c	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-13 15:56:59 +00:00
PyTorch MergeBot	472500e32a	Revert "Avoid performing replacements when it would unrefine ranges (#117356 )" This reverts commit `0e6b314fc2`. Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))	2024-02-13 01:16:58 +00:00
vfdev-5	ed20e9118b	Fixed hash issue in `fx_graph_cse` (#119567 ) Description: - Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))` Repro code: ```python import torch from torch._functorch.compile_utils import fx_graph_cse def func(inpt, osize): size = inpt.shape[-1] s1 = size - 1 s2 = size - 1.0 scale = s2 / (osize - 1.0) inpt = torch.clamp(inpt, 0, s1) return scale * inpt gms = [] def toy_backend(gm, _): gms.append(gm) return gm.forward torch._dynamo.reset() fn = torch.compile(backend=toy_backend, dynamic=True)(func) t = torch.rand(3, 100) out = fn(t, 50) gm = gms[0] print(gm.graph) new_fx_g = fx_graph_cse(gm.graph) print(str(new_fx_g)) ``` Original graph ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` New wrong graph where `sub_2` is replaced incorrectly with `sub`: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` With this PR the new graph is the following: ``` graph(): %s0 : torch.SymInt [num_users=0] = placeholder[target=s0] %s1 : torch.SymInt [num_users=0] = placeholder[target=s1] %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_] %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_] %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {}) %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {}) %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {}) %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {}) %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {}) %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {}) %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {}) %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {}) return (mul,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567 Approved by: https://github.com/eellison	2024-02-12 18:52:11 +00:00
Edward Z. Yang	0e6b314fc2	Avoid performing replacements when it would unrefine ranges (#117356 ) Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background. This PR does the following: * Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I only consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1` * The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work. * It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356 Approved by: https://github.com/lezcano	2024-02-09 14:43:58 +00:00
CaoE	dfdbd73360	add Half support for flash attention (#119247 ) Re-open for https://github.com/pytorch/pytorch/pull/118368. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247 Approved by: https://github.com/drisspg, https://github.com/malfet	2024-02-07 05:57:41 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
Mengwei Liu	1e4b408b02	[decomp] Add tests for different dtypes to SDPA decomposition (#119239 ) Summary: As titled. Skipping torch.bfloat16 because for some reason the difference is 0.01. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239 Approved by: https://github.com/drisspg	2024-02-06 11:17:07 +00:00
angelayi	1adedc3c86	[decomp] Remove pixel_shuffle from core aten decomps (#118921 ) pixel_shuffle is a core aten op (https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it. https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure (https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921 Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano	2024-02-03 08:21:32 +00:00
Yifu Wang	a1280f0cc6	Add an OpInfo test for split_with_sizes_copy (#118512 ) Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline. Changes made: - Registered a batching rule for `split_with_sizes_copy`. - Registered a decomposition for `split_with_sizes_copy`. - Registered a DTensor prop rule for `split_with_sizes_copy`. - Added required dtype and device checks to the composite impl. - Added output resize to the composite impl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512 Approved by: https://github.com/albanD	2024-02-01 07:09:27 +00:00
Elias Ellison	e87ac82c98	Fix missing default dim param in weight norm interface decomp (#118762 ) Fix for https://github.com/pytorch/pytorch/issues/118742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762 Approved by: https://github.com/ezyang, https://github.com/shunting314	2024-01-31 22:10:10 +00:00
Isuru Fernando	81d12846dc	Add decomp for pixel_shuffle/unshuffle (#118239 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239 Approved by: https://github.com/peterbell10	2024-01-31 18:34:21 +00:00
soulitzer	81b55f58ce	Matmul decide should_fold using has_out instead of grad_mode (#118617 ) Fixes https://github.com/pytorch/pytorch/issues/118548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617 Approved by: https://github.com/lezcano	2024-01-31 18:34:16 +00:00
Isuru Fernando	2f7839e6db	register decomposition for rsub in torch._refs (#118288 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288 Approved by: https://github.com/lezcano ghstack dependencies: #118398	2024-01-30 22:18:15 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Qingpeng Li	827949cef2	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/malfet	2024-01-30 13:24:13 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
albanD	24133e44b1	Fix return type hint for list types (#118238 ) All single element list types are `Tensor[]` so they will always be Tuple. I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :( Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238 Approved by: https://github.com/ezyang	2024-01-25 23:35:20 +00:00
PyTorch MergeBot	8dc421a6b4	Revert "accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 )" This reverts commit `03b12e56c7`. Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))	2024-01-22 14:48:35 +00:00
Qingpeng Li	03b12e56c7	accelerate `binary_cross_entropy_with_logits` by using `log_sigmoid` operator (#115539 ) When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function. Simple benchmark on AMD 3600 CPU Ubuntu 22.04: \|avg time (ms)\|with `pos_weight`\|no `pos_weight`\| \|-\|-\|-\| \|original\|1986\|1658\| \|this PR\|1295\|995\| faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code. CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned. The simple benchmark cpp file: [demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539 Approved by: https://github.com/lezcano	2024-01-19 14:56:43 +00:00
Digant Desai	e2830e6328	[PyTorch] SDPA decomp: actually use attn_mask (#117579 ) Summary: Need to pass this along Test Plan: ``` cd ~/fbsource/fbcode/executorch/backends/xnnpack/test buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32 ``` Reviewed By: larryliu0820 Differential Revision: D52812369 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579 Approved by: https://github.com/larryliu0820	2024-01-17 10:26:43 +00:00
vfdev-5	f6767244cf	Added meta function for _upsample_bicubic2d_aa (#117347 ) This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127 ``` /opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors) E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>((FakeTensor(..., size=(1, s0, s1, s2)),), {'size': [s4, floor(s3s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}): E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers E E from user code: E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image E image = interpolate( E E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information E E E You can suppress this exception and fall back to eager by setting: E import torch._dynamo E torch._dynamo.config.suppress_errors = True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347 Approved by: https://github.com/peterbell10	2024-01-16 23:33:55 +00:00
Mengwei Liu	1a8545164a	[export] Add unit test for SDPA export result (#117390 ) Summary: A follow up for #117097. In that PR I didn't add `_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition table. This PR does that and also add a unit test. Test Plan: python test/export/test_export.py -k test_scaled_dot_product_attention Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390 Approved by: https://github.com/drisspg	2024-01-14 00:21:28 +00:00
Aaron Orenstein	638f85fd67	Add default parameters to rrelu_with_noise() (#117141 ) Summary: rrelu_with_noise() was listed as having default parameters in the schema but the actual code definition didn't have them. The failing example was calling rrelu() which DOES have default parameters and it passes those defaulted values to C++. Under the covers the C code was calling the python version of rrelu_with_noise(). Although the C++ code was passing all the values to the python version of rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema and strips any parameters which match the schema's listed defaults so if the schema shows defaults that aren't in the code it will be a problem. Test Plan: I added a unit test for this specific case. It would probably be better to write a more general one to validate all the ops against their schemas - but I haven't learned enough about the test harness to do that yet. Fixes #115811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141 Approved by: https://github.com/yanboliang, https://github.com/oulgen	2024-01-12 05:32:13 +00:00
Elias Ellison	e3d4f4d14b	[ProxyTensor] dedupe symbolic shapes in tracing (#116158 ) Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158 Approved by: https://github.com/ezyang	2024-01-11 07:15:11 +00:00
Edward Z. Yang	edec54b9de	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #113397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang	2024-01-11 01:32:44 +00:00
Mengwei Liu	8783fe9cf3	[export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097 ) Summary: As titled. #115913 added `_scaled_dot_product_flash_attention_for_cpu` and the export result of `scaled_dot_product_attention` includes this op. Adding this decomposition so that it's being decomposed the same way as `_scaled_dot_product_attention_math`. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097 Approved by: https://github.com/lezcano	2024-01-10 23:46:14 +00:00
Elias Ellison	d6540038c0	Fix 0-dim Index in Index Copy decomp (#117065 ) Fix for https://github.com/pytorch/pytorch/issues/115931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-01-10 22:13:43 +00:00
Zhengxu Chen	b3f7fdbf0a	Add decomp for pad_sequence (#116285 ) Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization. Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence Reviewed By: SherlockNoMad Differential Revision: D52345667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-27 23:56:51 +00:00
Aaron Meurer	f08c4da86d	Add a decomposition for take() (#114813 ) Presumably this can close https://github.com/pytorch/pytorch/pull/109784 Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there). There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow? Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2023-12-22 18:14:57 +00:00
vfdev-5	f727bed2e6	[inductor] Updated upsample_bilinear2d decomposition (#104182 ) Description: - Updated upsample_bilinear2d decomposition - added support for uint8 dtype support - code improvements - Added uint8 dtype tests Perf considerations: - There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256) - For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account. ## Perfs benchmarks ``` [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 565.212 (+-3.548) \| 1384.210 (+-10.798) \| 1230.996 (+-32.930) \| 0.889 (+-0.000) \| 566.253 (+-1.526) Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 565.404 (+-1.614) \| 1491.649 (+-7.763) \| 2974.959 (+-6.006) \| 1.994 (+-0.000) \| 566.476 (+-1.742) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 270.761 (+-0.861) \| 1557.777 (+-4.699) \| 1080.919 (+-4.243) \| 0.694 (+-0.000) \| 269.829 (+-0.986) Input (1, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 270.960 (+-0.995) \| 1723.913 (+-12.433) \| 3191.938 (+-6.194) \| 1.852 (+-0.000) \| 269.962 (+-1.657) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 1555.884 (+-5.169) \| 1178.753 (+-4.957) \| 1910.445 (+-5.988) \| 1.621 (+-0.000) \| 1560.804 (+-6.793) Input (1, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1651.193 (+-6.952) \| 1323.466 (+-6.059) \| 3374.842 (+-8.168) \| 2.550 (+-0.000) \| 1653.497 (+-8.018) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 978.482 (+-10.183) \| 1383.768 (+-4.341) \| 2147.841 (+-6.581) \| 1.552 (+-0.000) \| 979.983 (+-1.499) Input (1, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 1074.472 (+-5.031) \| 1414.912 (+-5.754) \| 3590.968 (+-10.042) \| 2.538 (+-0.000) \| 1074.589 (+-3.948) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 2168.703 (+-8.964) \| 5400.528 (+-26.628) \| 4777.299 (+-11.891) \| 0.885 (+-0.000) \| 2168.133 (+-7.667) Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 2169.132 (+-12.618) \| 6583.866 (+-28.959) \| 11986.894 (+-45.838) \| 1.821 (+-0.000) \| 2174.488 (+-10.317) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 992.808 (+-6.086) \| 5985.028 (+-9.532) \| 4334.158 (+-9.423) \| 0.724 (+-0.000) \| 989.604 (+-5.499) Input (4, 3, 500, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 987.618 (+-6.350) \| 6963.044 (+-28.885) \| 15441.096 (+-55.324) \| 2.218 (+-0.000) \| 985.573 (+-5.159) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 6695.557 (+-35.067) \| 4657.603 (+-14.220) \| 8058.708 (+-41.684) \| 1.730 (+-0.000) \| 6714.996 (+-38.626) Input (4, 3, 500, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 7040.481 (+-39.486) \| 5445.704 (+-16.659) \| 13906.618 (+-53.298) \| 2.554 (+-0.000) \| 7034.453 (+-44.626) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (256, 256) \| 3926.186 (+-10.660) \| 5741.433 (+-12.748) \| 9356.036 (+-40.848) \| 1.630 (+-0.000) \| 3930.598 (+-17.086) Input (4, 3, 500, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (256, 256) \| 4308.536 (+-9.607) \| 6122.755 (+-47.278) \| 15637.567 (+-54.392) \| 2.554 (+-0.000) \| 4307.463 (+-11.268) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2512.740 (+-10.860) \| 1573.590 (+-5.061) \| 451.355 (+-1.210) \| 0.287 (+-0.000) \| 2511.727 (+-10.930) Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2489.926 (+-11.915) \| 1537.233 (+-4.212) \| 2501.470 (+-7.446) \| 1.627 (+-0.000) \| 2500.000 (+-12.155) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 632.032 (+-2.108) \| 1496.994 (+-4.194) \| 404.759 (+-1.064) \| 0.270 (+-0.000) \| 630.122 (+-4.086) Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 629.174 (+-4.386) \| 1708.935 (+-8.817) \| 2643.296 (+-9.723) \| 1.547 (+-0.000) \| 628.388 (+-1.326) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 4409.941 (+-8.016) \| 1160.133 (+-4.698) \| 1897.089 (+-9.392) \| 1.635 (+-0.000) \| 4450.959 (+-10.438) Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4493.427 (+-11.703) \| 1329.226 (+-4.740) \| 2835.872 (+-12.241) \| 2.133 (+-0.000) \| 4506.973 (+-9.914) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 901.712 (+-4.071) \| 1320.739 (+-5.197) \| 2207.605 (+-8.219) \| 1.671 (+-0.000) \| 904.757 (+-4.558) Input (1, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 990.080 (+-3.922) \| 1702.563 (+-7.909) \| 3074.196 (+-10.478) \| 1.806 (+-0.000) \| 990.482 (+-4.444) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 9785.550 (+-58.445) \| 6135.680 (+-33.569) \| 1628.572 (+-19.770) \| 0.265 (+-0.000) \| 9893.606 (+-62.377) Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 9710.191 (+-57.597) \| 6066.824 (+-36.364) \| 10469.110 (+-42.775) \| 1.726 (+-0.000) \| 9919.022 (+-72.190) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 2790.356 (+-12.188) \| 6134.101 (+-28.694) \| 1576.832 (+-6.030) \| 0.257 (+-0.000) \| 2761.122 (+-11.503) Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 2778.711 (+-13.603) \| 6608.528 (+-37.776) \| 10841.549 (+-49.429) \| 1.641 (+-0.000) \| 2753.037 (+-10.995) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 45533.868 (+-102.618) \| 4962.994 (+-8.215) \| 9003.968 (+-38.179) \| 1.814 (+-0.000) \| 43531.261 (+-102.951) Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 45932.699 (+-81.207) \| 5595.682 (+-11.482) \| 12302.907 (+-50.254) \| 2.199 (+-0.000) \| 43916.455 (+-80.468) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (200, 300) \| 3827.804 (+-8.057) \| 6311.580 (+-25.021) \| 11760.614 (+-51.531) \| 1.863 (+-0.000) \| 3849.959 (+-10.848) Input (4, 3, 1200, 1300), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (200, 300) \| 4169.007 (+-8.452) \| 6820.716 (+-35.310) \| 15264.633 (+-49.982) \| 2.238 (+-0.000) \| 4183.875 (+-19.104) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 1306.914 (+-7.470) \| 10598.101 (+-38.410) \| 2678.031 (+-11.051) \| 0.253 (+-0.000) \| 1307.470 (+-8.519) Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 1307.268 (+-8.197) \| 10161.123 (+-45.643) \| 17148.842 (+-55.402) \| 1.688 (+-0.000) \| 1308.077 (+-8.553) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 548.574 (+-2.157) \| 10072.806 (+-41.368) \| 2408.971 (+-6.997) \| 0.239 (+-0.000) \| 547.726 (+-1.721) Input (1, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 546.664 (+-1.484) \| 11123.694 (+-43.636) \| 18058.070 (+-48.552) \| 1.623 (+-0.000) \| 547.151 (+-1.627) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 7935.051 (+-71.022) \| 7654.533 (+-29.512) \| 12414.194 (+-87.450) \| 1.622 (+-0.000) \| 7900.056 (+-53.997) Input (1, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 8546.732 (+-53.118) \| 8583.572 (+-35.656) \| 19111.824 (+-166.978) \| 2.227 (+-0.000) \| 8515.433 (+-63.300) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 6202.642 (+-34.355) \| 8915.622 (+-62.293) \| 14327.295 (+-52.188) \| 1.607 (+-0.000) \| 6213.329 (+-39.740) Input (1, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 6811.128 (+-33.747) \| 9647.316 (+-50.837) \| 20830.594 (+-62.979) \| 2.159 (+-0.000) \| 6822.512 (+-37.092) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 5079.586 (+-19.067) \| 42238.442 (+-87.643) \| 11282.141 (+-42.477) \| 0.267 (+-0.000) \| 5104.234 (+-17.706) Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 5079.575 (+-16.306) \| 41512.995 (+-83.710) \| 68789.816 (+-440.001) \| 1.657 (+-0.000) \| 5097.446 (+-21.724) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 2039.974 (+-8.614) \| 42322.773 (+-111.866) \| 10399.237 (+-43.140) \| 0.246 (+-0.000) \| 2043.808 (+-10.707) Input (4, 3, 300, 400), torch.uint8, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 2036.214 (+-10.083) \| 44353.281 (+-71.548) \| 73340.412 (+-324.780) \| 1.654 (+-0.000) \| 2039.000 (+-9.554) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 33821.523 (+-96.639) \| 30552.094 (+-65.023) \| 49494.486 (+-872.916) \| 1.620 (+-0.000) \| 33844.404 (+-92.466) Input (4, 3, 300, 400), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 36196.104 (+-128.169) \| 34038.432 (+-79.697) \| 75761.226 (+-905.194) \| 2.226 (+-0.000) \| 36260.473 (+-94.642) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (600, 700) \| 24827.821 (+-77.335) \| 37006.218 (+-86.318) \| 61297.625 (+-898.192) \| 1.656 (+-0.000) \| 24823.275 (+-80.945) Input (4, 3, 300, 400), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (600, 700) \| 27266.138 (+-70.262) \| 40109.475 (+-94.248) \| 92086.075 (+-404.922) \| 2.296 (+-0.000) \| 27287.992 (+-89.507) Times are in microseconds (us). [--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitafcfdb1) PR \| Compiled (2.3.0a0+gitde89a53) Nightly \| speed-up PR vs Nightly \| Eager (2.3.0a0+gitde89a53) Nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 98.259 (+-0.014) \| 97.156 (+-0.008) \| 97.443 (+-0.031) \| 1.003 (+-0.000) \| 98.248 (+-0.021) Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 97.048 (+-0.016) \| 97.480 (+-0.018) \| 96.819 (+-0.126) \| 0.993 (+-0.000) \| 97.045 (+-0.015) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 97.944 (+-0.028) \| 91.686 (+-0.411) \| 93.894 (+-1.011) \| 1.024 (+-0.000) \| 97.933 (+-0.008) Input (1, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 98.008 (+-0.011) \| 91.205 (+-0.346) \| 96.854 (+-0.058) \| 1.062 (+-0.000) \| 97.203 (+-0.010) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 384.318 (+-0.011) \| 382.793 (+-0.007) \| 382.472 (+-0.011) \| 0.999 (+-0.000) \| 384.701 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.266 (+-0.009) \| 385.333 (+-0.024) \| 382.554 (+-0.022) \| 0.993 (+-0.000) \| 384.386 (+-0.016) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345) \| 383.924 (+-0.011) \| 570.071 (+-0.030) \| 545.615 (+-0.051) \| 0.957 (+-0.000) \| 384.044 (+-0.012) Input (4, 3, 2345, 2456), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345) \| 384.184 (+-0.016) \| 560.857 (+-0.026) \| 552.447 (+-0.040) \| 0.985 (+-0.000) \| 384.063 (+-0.016) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 122.188 (+-0.053) \| 116.744 (+-1.006) \| 163.762 (+-0.015) \| 1.403 (+-0.000) \| 121.874 (+-0.015) Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 122.156 (+-0.012) \| 182.692 (+-0.013) \| 161.653 (+-0.018) \| 0.885 (+-0.000) \| 121.926 (+-0.014) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 105.852 (+-0.324) \| 119.545 (+-0.294) \| 190.527 (+-0.023) \| 1.594 (+-0.000) \| 105.999 (+-0.446) Input (1, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 106.507 (+-0.282) \| 120.060 (+-0.257) \| 162.330 (+-0.012) \| 1.352 (+-0.000) \| 106.567 (+-0.385) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 447.907 (+-0.015) \| 463.863 (+-1.779) \| 650.492 (+-0.331) \| 1.402 (+-0.000) \| 446.596 (+-0.017) Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 447.750 (+-0.017) \| 723.832 (+-0.170) \| 641.539 (+-0.075) \| 0.886 (+-0.000) \| 446.467 (+-0.019) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456) \| 439.549 (+-0.031) \| 507.772 (+-2.879) \| 758.795 (+-0.482) \| 1.494 (+-0.000) \| 440.372 (+-0.025) Input (4, 3, 1234, 1345), torch.float32, torch.channels_last \| mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456) \| 439.538 (+-0.029) \| 509.260 (+-2.704) \| 654.195 (+-2.621) \| 1.285 (+-0.000) \| 440.362 (+-0.026) Times are in microseconds (us). ``` [Source](`f4751a3196/perf_interp_mode.py`), [Output](`899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182 Approved by: https://github.com/lezcano	2023-12-14 14:50:06 +00:00
angelayi	639060cb0b	Use get_mkldnn_enabled for decompositions (#115448 ) `torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448 Approved by: https://github.com/jgong5, https://github.com/ydwu4	2023-12-12 22:42:51 +00:00
Isuru Fernando	505574c46a	Add decomposition for torch.block_diag (#115096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096 Approved by: https://github.com/peterbell10	2023-12-11 20:04:22 +00:00
Isuru Fernando	d40a7c6026	Add decompositions for replication_pad (#115113 ) Fixes #115395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113 Approved by: https://github.com/peterbell10	2023-12-09 02:44:07 +00:00
Isuru Fernando	fb19947962	Add decompositions for reflection_pad{1, 2, 3}d (#115100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100 Approved by: https://github.com/peterbell10	2023-12-08 23:05:57 +00:00
Jason Ansel	7979ba7b43	[inductor] Add dropout type check to match eager (#115040 ) Fixes #98970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115040 Approved by: https://github.com/oulgen	2023-12-03 23:05:02 +00:00
Kurt Mohler	6f32eb7eef	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-12-01 18:56:09 +00:00
PyTorch MergeBot	013675ff59	Revert "Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 )" This reverts commit `f1286161a6`. Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job. The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))	2023-11-30 02:28:14 +00:00
Kurt Mohler	f1286161a6	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-11-29 21:50:46 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
vfdev-5	1f8d00c5a3	[inductor] Added decomposition for upsample_nearest_exact Nd (#113749 ) Description: - Added decomposition for upsample_nearest_exact: 1d, 2d, 3d Pull Request resolved: https://github.com/pytorch/pytorch/pull/113749 Approved by: https://github.com/lezcano	2023-11-21 13:03:47 +00:00
PyTorch MergeBot	fe428a284b	Revert "Add `torch._lazy_clone` to create COW tensors (#113397 )" This reverts commit `9916d8a9ea`. Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))	2023-11-20 10:21:09 +00:00
GD06	b30580e121	[PT] Include tensor shape info in the error messages of torch split (#113984 ) Summary: Include tensor shape info in the error messages of torch split. Test Plan: CI Differential Revision: D51436684 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113984 Approved by: https://github.com/ezyang	2023-11-19 01:34:57 +00:00
Kurt Mohler	9916d8a9ea	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang ghstack dependencies: #113396	2023-11-17 01:58:51 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `54493fe8c4`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Antonio Kim	54493fe8c4	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-13 23:18:14 +00:00
Mengwei Liu	5506b9db43	[decomp] Fix _scaled_dot_product_flash_attention decomposition bug (#113102 ) For `_scaled_dot_product_flash_attention` we don't have `Tensor? attn_mask=None` but `scaled_dot_product_attention` has. In the original decomp there's a mixup where I added this argument to `_scaled_dot_product_flash_attention`. Fix it so that `_scaled_dot_product_flash_attention` is being decomposed correctly. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/113102 Approved by: https://github.com/ezyang	2023-11-08 21:47:37 +00:00
PyTorch MergeBot	9a28a7b498	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `27e31ab6e8`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))	2023-11-07 15:53:32 +00:00
Antonio Kim	27e31ab6e8	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-06 21:27:02 +00:00
Han Qi	5a6f8014c4	Add a decomposition for _weight_norm_interface. (#112193 ) Fixes #112086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112193 Approved by: https://github.com/ezyang	2023-11-01 19:51:11 +00:00
Peter Bell	04024926f4	Use `pytree.tree_map_` everywhere (#112417 ) Wherever we discard the output of `tree_map` it's better to call `tree_map_` which doesn't unflatten the mapped results and so is a lot cheaper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417 Approved by: https://github.com/lezcano ghstack dependencies: #112391, #112392, #112393, #112394	2023-10-31 15:57:06 +00:00
Peter Bell	66c32d099a	Use `pytree.arg_tree_leaves` everywhere (#112394 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112394 Approved by: https://github.com/lezcano ghstack dependencies: #112391, #112392, #112393	2023-10-31 15:57:06 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
lezcano	c8a5bb451e	Do not import sympy within torch._prims_common (#112034 ) This is the first of a few PRs that avoid importing SymPy at import time. The pitch here is that we (almost!) do not have SymPy on our API, so this should be feasible. This should speed-up torch imports by a good 15% as per https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589 In this PR we just move a few global imports into local imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034 Approved by: https://github.com/ezyang	2023-10-26 12:53:25 +00:00
PyTorch MergeBot	98c329b19e	Revert "[core ATen IR] Add decompositions for max, min, var_mean (#110906 )" This reverts commit `9606cda64e`. Reverted https://github.com/pytorch/pytorch/pull/110906 on behalf of https://github.com/SS-JIA due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110906#issuecomment-1757490740))	2023-10-11 11:41:21 +00:00
SS-JIA	9606cda64e	[core ATen IR] Add decompositions for max, min, var_mean (#110906 ) ## Context Add decompositions for `aten.max`, `aten.min`, and `aten.var_mean`. These operators follow a pattern of returning a tuple of outputs from two component operators: ``` aten.max(x) -> return aten.amax(x), aten.argmax(x) aten.min(x) -> return aten.amin(x), aten.argmin(x) aten.var_mean(x) -> return aten.var(x), aten.mean(x) ``` For `var_mean`, the `refs` implementation was doing something similar, so I changed it to call `torch.` ops instead like was done for other `refs` implementations previously. cc: @peterbell10 @lezcano Note that Inductor lowers all these directly, so they are excluded from the Inductor decomp table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110906 Approved by: https://github.com/manuelcandales	2023-10-11 00:06:24 +00:00
Kazuaki Ishizaki	fde28fdc8c	Fix typo under torch/_decomp directory (#110821 ) This PR fixes typo of comments in files under `torch/_decomp` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110821 Approved by: https://github.com/Skylion007	2023-10-08 20:33:49 +00:00
Stephen Jia	c2e7a0d689	[core IR] Add decomps for `aten.sum` and `aten.squeeze` variants (#110645 ) Summary: ## Context Both `aten.sum` and `aten.squeeze` have a "most generic" variant in the form of `aten.sum.dim_IntList` and `aten.squeeze.dims` respectively. Add decompositions for other non generic variants of these operators to express them using the most generic variant. Note that to register these decomps, the reference implementation under `_refs` had to be removed as registered decompositions. cc: @lezcano @peterbell10 Test Plan: Github CI + Meta Internal CI Differential Revision: D49965952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110645 Approved by: https://github.com/peterbell10, https://github.com/digantdesai, https://github.com/manuelcandales	2023-10-07 04:21:51 +00:00
cdzhan	7cc0020a80	[decomp] Fix different return type in threshold_backward vs. eager (#110689 ) due to type promotion with floating point scalar in decompositions.py Fixes part of #100838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110689 Approved by: https://github.com/ezyang	2023-10-06 20:59:58 +00:00
chilli	ceb773b68d	Fix #110680 (requires_grad typo in decomp) (#110687 ) Fixes https://github.com/pytorch/pytorch/issues/110680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110687 Approved by: https://github.com/voznesenskym, https://github.com/lezcano ghstack dependencies: #110501, #110504, #110591, #110668	2023-10-06 10:36:01 +00:00
Jerry Zhang	f2a1b93549	Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226 )" (#110316 ) Summary: Original commit changeset: acdb5b34e3aa Original Phabricator Diff: D47321689 Test Plan: opinfo tests in CI Differential Revision: D49789403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316 Approved by: https://github.com/kimishpatel	2023-10-03 16:59:23 +00:00
Stephen Jia	ff96f6d04f	[core IR][reland] Add `split.Tensor` and `unbind` decompositions to core ATen decomp table (#110323 ) Summary: This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102). The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions. Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](`13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)`). Therefore decomposing `lift_fresh` will interfere with constant propogation during export. Test Plan: Github CI and internal CI Differential Revision: D49761321 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323 Approved by: https://github.com/jansel	2023-10-03 14:35:04 +00:00
Peter Bell	be3b16daad	[decomp] Fix baddbmm decomposition (#109714 ) The decomposition is currently registered without the pw_cast_for_opmath decorator, due to the ordering of decorators being meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714 Approved by: https://github.com/lezcano	2023-09-28 21:23:44 +00:00
PyTorch MergeBot	e0b035c220	Revert "[core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102 )" This reverts commit `22e706f768`. Reverted https://github.com/pytorch/pytorch/pull/110102 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110102#issuecomment-1739856671))	2023-09-28 19:03:25 +00:00
SS-JIA	22e706f768	[core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102 ) ## Context Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly. One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102 Approved by: https://github.com/jansel	2023-09-28 01:21:45 +00:00
SS-JIA	dec140f1ea	[core IR] Add a core decomposition for aten.all (#110093 ) ## Context Change the ref implementation of `aten.all` to only use other `torch` operators such that we can use it for the core ATen decomposition table. This will replace the decomposition for `aten.all` that was used specifically by Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110093 Approved by: https://github.com/manuelcandales, https://github.com/peterbell10, https://github.com/lezcano	2023-09-27 01:31:41 +00:00
SS-JIA	9928c10e71	[core IR] Add glu as a core decomposition (#110043 ) ## Context Add the decomposition for `aten.glu` as a decomposition in the core ATen decomposition table. Don't use it in the Inductor decomposition table since Inductor has a lowering for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110043 Approved by: https://github.com/peterbell10, https://github.com/lezcano ghstack dependencies: #110046	2023-09-27 00:23:05 +00:00
SS-JIA	5df8aca994	[core IR] Add a core decomposition for floor_divide (#110046 ) ## Context Introduce a core decomposition for `aten.floor_divide` into other `aten` ops, and add it to the core ATen decomposition table. This replaces the decomposition of `floor_divide` that was used by Inductor. I noticed there was a note on that decomposition ``` # TorchInductor-only decomposition. It should not be taken to core. # See https://github.com/pytorch/torchdynamo/pull/1120 ``` but couldn't discern the reason why this is the case. cc: @lezcano Pull Request resolved: https://github.com/pytorch/pytorch/pull/110046 Approved by: https://github.com/peterbell10	2023-09-26 08:39:21 +00:00
Mwiza Kunda	5c4b5baf21	Fix python decomps for OpOverloadPackets and add tests (#107707 ) - Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments) - Add out parameter wrappers to python decomps for aten ops that have out overloads CC. @ezyang @albanD @lezcano Fixes #107713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707 Approved by: https://github.com/lezcano	2023-09-25 20:53:30 +00:00
SS-JIA	7de669f2f9	[core IR] Remove trunc decomp and add trunc to core (#109902 ) Following up from [this comment](https://github.com/pytorch/pytorch/pull/109319#discussion_r1330803226). Remove the decomposition for `trunc`, and add it as a core operator. Going forward, provide similar treatment for operators that map cleanly to hardware instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109902 Approved by: https://github.com/peterbell10	2023-09-25 18:18:06 +00:00
Jijie Wei	334ead04a9	Back out "[decomp] Fix baddbmm decomposition (#109714 )" (#109855 ) Summary: Original commit changeset: 95c462a380c9 Original Phabricator Diff: D49484954 this diff cause test failure for deterministic ne test see:https://www.internalfb.com/sandcastle/job/18014399565419856/ Test Plan: buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test -- --exact 'aps_models/ads/icvr/tests:icvr_fm_e2e_deterministic_ne_test - aps_models.ads.icvr.tests.icvr_fm_e2e_deterministic_ne_test.ICVR_FM_E2EDeterministicNeTest: test_e2e_deterministic_icvr_fm_pt2_fsdp_multi_gpus' https://www.internalfb.com/intern/testinfra/testrun/16888498605839953 Differential Revision: D49527271 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109855 Approved by: https://github.com/yanboliang	2023-09-22 22:01:38 +00:00
Mwiza Kunda	8dedc9dd9b	Add meta tests for layer/group/batch norm backward (#109591 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/109591 Approved by: https://github.com/ezyang	2023-09-21 18:58:51 +00:00
Mwiza Kunda	6b7b9c796e	Fix registering jit decompositions for jvp for out wrapped decomps (#109367 ) Python decompositions wrapped by `out_wrapper` need to be unwrapped before compiling with TorchScript since: - `out_wrapper` extends the decompositions signature with an out parameter, however this `out` parameter is not present in the source code of the original decomposition so the resulting `ScriptFunction` will not have an `out` parameter - `out_wrapper` is in the `torch._prims_common.wrappers` module so its `globals()` are different to the globals of the decomposition to be wrapped. This may cause symbol resolution to fail with the TorchScript compiler since it is compiling the unwrapped decomps source code rather than the wrapper The python decomposition for `aten.trace` is wrapped as an example, other decompositions are to be fixed in https://github.com/pytorch/pytorch/pull/107707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109367 Approved by: https://github.com/lezcano	2023-09-21 16:36:51 +00:00
Peter Bell	6f0cf5a837	[decomp] Decompose unsafe_split{,_with_sizes} into safe variants (#109668 ) The "safety" aspect refers to the output not being registered as aliasing the input, but after AOTAutograd I don't think this distinction matters. However, we shouldn't use the same decomposition as the safe variant in case the backend doesn't want to decompose split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109668 Approved by: https://github.com/lezcano ghstack dependencies: #109667	2023-09-20 18:45:56 +00:00
Peter Bell	9e629dd73c	[decomp] Add all std and std_mean overloads to core decompostions (#109667 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109667 Approved by: https://github.com/lezcano	2023-09-20 18:45:56 +00:00
Peter Bell	36a8105f54	[decomp] Fix baddbmm decomposition (#109714 ) The decomposition is currently registered without the pw_cast_for_opmath decorator, due to the ordering of decorators being meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714 Approved by: https://github.com/lezcano	2023-09-20 18:40:21 +00:00
Salil Desai	40b2c796dc	[Decomposition] baddbmm (#108534 ) Summary: Moving decomposition of baddbmm from _inductor/decomposition.py and include it in core_aten_decompositions `ff38c0e2f9/torch/_inductor/decomposition.py (L203)` Test Plan: Phabricator + OSS Tests Differential Revision: D48871741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108534 Approved by: https://github.com/SherlockNoMad	2023-09-20 12:49:32 +00:00
Salil Desai	d0cc623192	[Decomposition] _unsafe_view (#108713 ) Summary: Decomp already exists so just add it to core_aten_decompositions https://www.internalfb.com/code/fbsource/[9d5eabd7b213d1a356d4e7bb400355d574ea924b]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=3091 Differential Revision: D48619079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108713 Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad	2023-09-19 13:37:35 +00:00
Salil Desai	2e721aab98	[Decomposition] Trunc (#109319 ) Summary: Add Decomp for Trunc and add it to core_aten_decompositions Differential Revision: D49042033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109319 Approved by: https://github.com/SherlockNoMad	2023-09-19 13:30:13 +00:00
Salil Desai	ae66d0b3bf	[Decomposition] clamp_max (#108718 ) Summary: Decomp already exists so just add it to core_aten_decompositions https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1855 Differential Revision: D48880026 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108718 Approved by: https://github.com/SherlockNoMad	2023-09-19 13:25:35 +00:00
Salil Desai	fc47ba2794	[Decomposition] clamp_min (#108717 ) Summary: Decomp already exists so just add it to core_aten_decompositions https://www.internalfb.com/code/fbsource/[abda43a5a268e83fef6d62b49531a390ce915ad2]/fbcode/caffe2/torch/_refs/__init__.py?lines=1846 Differential Revision: D48880080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108717 Approved by: https://github.com/SherlockNoMad	2023-09-18 12:43:58 +00:00
Salil Desai	a6d4cca7c0	[Decomposition] unsafe_split.Tensor (#108544 ) Summary: Include decomp in core_aten_decompositions Decomp already exists https://www.internalfb.com/code/fbsource/[03ff511cad587fc27ed8fd6a54b87845246e8e0c]/fbcode/caffe2/torch/_decomp/decompositions.py?lines=1209 Test Plan: OSS + Phabricator Tests Differential Revision: D48940445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108544 Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad	2023-09-18 12:43:07 +00:00
Salil Desai	af93b29c5e	[Decomposition] std.correction (#108733 ) Summary: Include decomp in core_aten_decompositions Decomp: https://www.internalfb.com/code/fbsource/[e69bf00ff87a55c9a30bd7905881661ff05fa211]/fbcode/caffe2/torch/_refs/__init__.py?lines=2398 Differential Revision: D48940402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108733 Approved by: https://github.com/larryliu0820, https://github.com/SherlockNoMad	2023-09-18 11:38:23 +00:00
Jez Ng	db48bc80d9	Check index size during decomp of index_add (#108826 ) This partially fixes the `test_index_add_correctness` test (#108181) when run under inductor: it causes an exception to be raised [here][1] as expected. The test as a whole still cannot be made to pass under inductor because the [last assert][2] still fails, likely due to #108798. [1]: `dec2b267d4/test/test_torch.py (L6049)` [2]: `dec2b267d4/test/test_torch.py (L6051)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108826 Approved by: https://github.com/eellison	2023-09-13 13:06:26 +00:00
Ken Jin	c458fa0d35	Decompose/add reference for `view_as_complex` (#108005 ) Aten source: `d4a99631dd/aten/src/ATen/native/ComplexHelper.h (L78)` Documentation reference: https://pytorch.org/docs/stable/generated/torch.view_as_complex.html Note: this adds a new primitive `view_of_dtype`, which is trivially implemented, as its meta function is already implemented elsewhere. Finally, this is not registered as a decomposition (yet), because TorchInductor does not yet support complex types. It should be added once we do. Closes https://github.com/pytorch/pytorch/issues/108020 as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108005 Approved by: https://github.com/peterbell10, https://github.com/ezyang	2023-09-07 23:49:20 +00:00
Edward Z. Yang	9f37aec964	Add torch._check_is_size (#108685 ) Check comments for what it does. The key distinction is that if you feed it an unbacked SymInt, we will also apply >= 2 assumption at compile time. This will get exercised when I reland https://github.com/pytorch/pytorch/pull/107788 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108685 Approved by: https://github.com/albanD, https://github.com/Skylion007	2023-09-07 12:48:39 +00:00
Sam Larsen	27fe45eaf6	[inductor][easy] Enable Mypy Checking for torch/_inductor/decomposition.py (#108682 ) Summary: Looks like one simple type mismatch between `get_decompositions()` and `remove_decompositions()` Test Plan: `lintrunner torch/_inductor/decomposition.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108682 Approved by: https://github.com/eellison	2023-09-07 00:48:55 +00:00
Huy Do	5a4fe05a15	Revert "Force synced KJT to trace unbacked SymInt (#107788 )" (#108684 ) This reverts commit `3b92ef814d`. So let's manually revert it instead. (Not sure why the bot doesn't work on https://github.com/pytorch/pytorch/pull/107788) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108684 Approved by: https://github.com/ezyang	2023-09-06 19:15:45 +00:00
Kimish Patel	ebed490c2f	[sdpa decomp] change sdpa decomp to be consistent with flash attention (#108608 ) Summary: See the comment in code for the reasons of the change Test Plan: buck2 test executorch/examples/export/test:test_export -- test_vit_export_to_executorch Differential Revision: D48992180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108608 Approved by: https://github.com/larryliu0820	2023-09-06 15:34:03 +00:00
Edward Z. Yang	3b92ef814d	Force synced KJT to trace unbacked SymInt (#107788 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107788 Approved by: https://github.com/voznesenskym	2023-09-06 03:18:26 +00:00
Kimish Patel	cc50e654d4	[aten decomp] Update sdpa decom (#108371 ) Summary: Earlier decomp was routing _flash* variant to _match variant and this was result in failure during torch.export, for some reason that I couldnt trace. However, it seems that we should really have a decomp for scaled_dot_product_attention, instead of scaled_dot_product_flash_attention. Right? This diff adds that. Plus it adds a test to check if the model exported via two stage export, has decomposed the op. This test needs improvement to figur eout what the core aten opset is and check for anything that is not inside. Test Plan: test_model_exports_to_core_aten Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D48917461](https://our.internmc.facebook.com/intern/diff/D48917461) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108371 Approved by: https://github.com/larryliu0820	2023-09-03 15:17:08 +00:00
lezcano	239ee76177	Add refs/decomps for dot/vdot (#108194 ) Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194 Approved by: https://github.com/peterbell10 ghstack dependencies: #108188	2023-08-31 15:30:23 +00:00
rzou	0e4752bafc	Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080 ) We allow registering decomps for HigherOrderOp via the existing decomp mechanisms: - I refactored those APIs to accept torch._ops.OperatorBase, which is the base class for torch.ops.HigherOrderOperator and torch.ops.OpOverload - HigherOrderOps must directly call maybe_handle_decomp in their ProxyTorchDispatchMode handling in order to resolve decompositions. We can change this in the future so that they do not need to do this. Next, we add an inductor decomp for out_dtype. This decomp shouldn't be generally available because we want to preserve out_dtype to the backend for other use cases (i.e. executorch). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080 Approved by: https://github.com/HDCharles	2023-08-31 03:15:38 +00:00
chilli	39130c7433	Add reinplacing pass for scatters + incremental fake tensor updating (#106192 ) mutation for params) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192 Approved by: https://github.com/jansel, https://github.com/eellison	2023-08-30 20:41:37 +00:00
Mengwei Liu	0fb1c05c5a	[pytorch] Add decomp rule for scaled_dot_product_attention (#108180 ) `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108180 Approved by: https://github.com/SherlockNoMad	2023-08-30 15:52:08 +00:00
vfdev-5	0cfc5899f9	[inductor] Improved grid_sampler_2d decomposition for cuda (#104710 ) Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to https://github.com/pytorch/pytorch/issues/104296 Perfs: - speed-up on cuda (~x5) and cpu (~x2) for bicubic mode ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git52598e9) PR \| Compiled (2.1.0a0+git52598e9) PR \| Compiled (2.1.0a0+gitcf76938) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 38.010 (+-0.118) \| 51.466 (+-1.257) \| 47.867 (+-0.124) \| 0.930 (+-0.000) \| 33.654 (+-0.411) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 35.532 (+-0.236) \| 52.189 (+-0.093) \| 58.979 (+-0.206) \| 1.130 (+-0.000) \| 32.543 (+-0.198) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 38.187 (+-0.112) \| 47.892 (+-0.117) \| 45.833 (+-0.081) \| 0.957 (+-0.000) \| 33.752 (+-0.116) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 36.708 (+-0.244) \| 51.680 (+-0.104) \| 58.360 (+-0.108) \| 1.129 (+-0.000) \| 32.576 (+-0.751) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 24.201 (+-0.088) \| 27.451 (+-0.059) \| 27.937 (+-0.081) \| 1.018 (+-0.000) \| 24.367 (+-0.074) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 19.266 (+-0.105) \| 26.070 (+-0.085) \| 26.092 (+-0.054) \| 1.001 (+-0.000) \| 20.144 (+-0.064) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 24.293 (+-0.125) \| 26.085 (+-0.064) \| 26.575 (+-0.061) \| 1.019 (+-0.000) \| 24.515 (+-0.095) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 19.440 (+-0.075) \| 25.252 (+-0.059) \| 25.259 (+-0.051) \| 1.000 (+-0.000) \| 19.770 (+-0.070) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 114.900 (+-0.508) \| 113.416 (+-1.271) \| 248.679 (+-1.431) \| 2.193 (+-0.000) \| 114.609 (+-0.515) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 115.973 (+-0.555) \| 124.711 (+-1.596) \| 282.187 (+-2.418) \| 2.263 (+-0.000) \| 115.368 (+-0.652) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 111.730 (+-0.562) \| 110.914 (+-0.865) \| 253.899 (+-2.226) \| 2.289 (+-0.000) \| 111.285 (+-1.226) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 112.859 (+-0.487) \| 131.696 (+-1.298) \| 294.124 (+-1.963) \| 2.233 (+-0.000) \| 110.910 (+-0.969) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git52598e9) PR \| Compiled (2.1.0a0+git52598e9) PR \| Compiled (2.1.0a0+gitcf76938) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 228.811 (+-0.037) \| 92.990 (+-0.446) \| 92.648 (+-0.286) \| 0.996 (+-0.000) \| 228.274 (+-0.067) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 222.107 (+-0.076) \| 93.247 (+-0.387) \| 92.528 (+-0.423) \| 0.992 (+-0.000) \| 221.922 (+-0.297) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 235.654 (+-0.055) \| 75.781 (+-0.566) \| 115.865 (+-0.419) \| 1.529 (+-0.000) \| 236.032 (+-0.111) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 226.752 (+-0.088) \| 76.312 (+-0.328) \| 116.468 (+-0.477) \| 1.526 (+-0.000) \| 226.950 (+-0.027) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 225.540 (+-0.013) \| 75.638 (+-0.341) \| 72.621 (+-0.292) \| 0.960 (+-0.000) \| 225.937 (+-0.017) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 217.425 (+-0.024) \| 75.484 (+-0.545) \| 73.518 (+-0.296) \| 0.974 (+-0.000) \| 217.793 (+-0.008) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 231.474 (+-0.020) \| 75.972 (+-0.339) \| 73.030 (+-0.387) \| 0.961 (+-0.000) \| 231.991 (+-0.184) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 223.408 (+-0.016) \| 75.622 (+-0.279) \| 73.542 (+-0.336) \| 0.973 (+-0.000) \| 223.893 (+-0.021) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 319.382 (+-0.023) \| 149.060 (+-0.190) \| 772.116 (+-0.266) \| 5.180 (+-0.000) \| 320.549 (+-0.387) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 319.987 (+-0.134) \| 154.443 (+-0.014) \| 797.651 (+-0.232) \| 5.165 (+-0.000) \| 320.665 (+-0.397) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 326.138 (+-0.439) \| 149.092 (+-0.036) \| 772.508 (+-0.259) \| 5.181 (+-0.000) \| 325.751 (+-0.398) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 326.024 (+-0.118) \| 154.452 (+-0.209) \| 797.756 (+-0.229) \| 5.165 (+-0.000) \| 326.870 (+-0.372) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104710 Approved by: https://github.com/lezcano	2023-08-29 05:54:24 +00:00
Sam Larsen	20f3808aa2	Implement decomposition for aten.tensor_split.tensor_indices_or_sections (#107251 ) Summary: Before this change, the tensor_indices_or_sections variant of aten.tensor_split causes a `RuntimeError: The tensor has a non-zero number of elements` due to that operation needing to introspect data. Decomposing into one of the other two tensor_split variants fixes the problem. Test Plan: Enabled tensor_split tests in test/inductor/test_torchinductor_opinfo.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/107251 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-08-28 17:01:23 +00:00
ssjia	86f9fec3ac	Avoid decomposing `_unsafe_index` in Inductor (#107882 ) `_unsafe_index` was previously added to the core ATen decomp table in https://github.com/pytorch/pytorch/pull/106814, but this has performance ramifications for Inductor. Therefore, this diff removes it from the decomposition table used by Inductor. Differential Revision: [D48649210](https://our.internmc.facebook.com/intern/diff/D48649210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107882 Approved by: https://github.com/SherlockNoMad	2023-08-25 04:51:53 +00:00
Vishwa Raj Singh	35de780aa6	Fix Inplace tensor update on transpose (#104689 ) Fixes #https://github.com/pytorch/pytorch/issues/103650 - To align with HPU device backend architecture. Ensure all non-view ops return contiguous fake tensor outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689 Approved by: https://github.com/ezyang	2023-08-24 16:58:50 +00:00
Andrew Or	64d5851b1f	make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791 ) Summary: (From Brian Hirsh) Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329 So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190) There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops. QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key. If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key. This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm. Test Plan: python test/test_decomp.py Differential Revision: D48607575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791 Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad	2023-08-24 15:19:07 +00:00
Sherlock Huang	ee4b99cc3a	Decomp for aten.dropout (#106274 ) When exporting dropout with cpu tensor, we get following graph module ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[512, 10]): empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format) bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9); empty_memory_format = None div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9); bernoulli_p = None mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar); arg0_1 = div_scalar = None return (mul_tensor,) ``` In addition, if we export with eval() mode, we will have an empty graph. However, when exporting with cuda tensor, we got ``` class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[512, 10]): native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True); arg0_1 = None getitem: f32[512, 10] = native_dropout_default[0]; native_dropout_default = None return (getitem,) ``` and exporting under eval() mode will still have a dropout node in graph. This PR make exporting with CPU tensor also produce aten.native_dropout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274 Approved by: https://github.com/ezyang	2023-08-23 21:12:37 +00:00
Edward Z. Yang	5673c0874c	Use expect_true to make split with unbacked sizes work. (#106788 ) This pattern shows up in torchrec KeyedJaggedTensor. Most of the change in this PR is mechanical: whenever we failed an unbacked symint test due to just error checking, replace the conditional with something that calls expect_true (e.g., torch._check or TORCH_SYM_CHECK). Some of the changes are a bit more nuanced, I've commented on the PR accordingly. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788 Approved by: https://github.com/lezcano ghstack dependencies: #106720	2023-08-15 20:31:30 +00:00
lezcano	2c5f96deac	[Inductor] Make softshrink composite implicit (#107052 ) The backward is pretty much equivalent to the one we had written Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038, #107039, #107051	2023-08-14 21:01:50 +00:00
lezcano	3b1254e800	Make hardshrink's decomp composite implicit (#107039 ) The generated code is the same Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039 Approved by: https://github.com/peterbell10 ghstack dependencies: #107038	2023-08-14 21:01:50 +00:00
Sam Larsen	e165938853	Implement decomposition for aten.rrelu_with_noise (#106812 ) Test Plan: * Primarily, added new test in test/test_decomp.py * Updated existing tests, e.g., to NOT expect failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812 Approved by: https://github.com/eellison	2023-08-11 19:18:29 +00:00
Stephen Jia	8c8477e55a	Add _unsafe_index decomp (#106814 ) Summary: Redirect `aten._unsafe_index` to `aten.index` through a decomposition. Also add it to the list of core decompositions. Test Plan: contbuild and OSS CI (similar to D40075277) Differential Revision: D48163393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814 Approved by: https://github.com/SherlockNoMad	2023-08-10 23:23:37 +00:00
vfdev-5	35a1913370	[inductor] Added affine_grid_generator decomposition (#104709 ) Description: - Added affine_grid_generator decomposition Related to https://github.com/pytorch/pytorch/issues/104296 Fixes https://github.com/pytorch/pytorch/issues/105565 Perfs: - speed-up on cuda with bilinear and nearest modes ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 7.467 (+-0.036) \| 11.905 (+-0.276) \| 13.391 (+-0.051) \| 1.125 (+-0.000) \| 7.343 (+-0.036) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 7.722 (+-0.168) \| 14.371 (+-0.035) \| 15.899 (+-0.038) \| 1.106 (+-0.000) \| 7.870 (+-0.043) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 7.710 (+-0.051) \| 11.354 (+-0.053) \| 13.376 (+-0.045) \| 1.178 (+-0.000) \| 7.698 (+-0.061) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 7.870 (+-0.050) \| 13.744 (+-0.237) \| 15.206 (+-0.102) \| 1.106 (+-0.000) \| 7.912 (+-0.039) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 4.738 (+-0.015) \| 4.508 (+-0.005) \| 6.566 (+-0.027) \| 1.456 (+-0.000) \| 4.630 (+-0.022) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 4.391 (+-0.010) \| 4.860 (+-0.390) \| 6.438 (+-0.047) \| 1.325 (+-0.000) \| 4.458 (+-0.010) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 4.279 (+-0.008) \| 4.127 (+-0.010) \| 6.598 (+-0.709) \| 1.599 (+-0.000) \| 5.064 (+-0.025) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 4.537 (+-0.010) \| 4.593 (+-0.006) \| 6.365 (+-0.104) \| 1.386 (+-0.000) \| 4.480 (+-0.011) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 26.411 (+-0.066) \| 62.275 (+-0.436) \| 64.486 (+-0.353) \| 1.035 (+-0.000) \| 26.210 (+-0.110) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 26.457 (+-0.096) \| 72.887 (+-0.247) \| 74.207 (+-0.337) \| 1.018 (+-0.000) \| 25.995 (+-0.120) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 26.457 (+-0.086) \| 64.110 (+-0.233) \| 66.340 (+-0.406) \| 1.035 (+-0.000) \| 26.145 (+-0.085) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 26.536 (+-0.094) \| 73.742 (+-0.483) \| 71.946 (+-0.460) \| 0.976 (+-0.000) \| 26.457 (+-0.166) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------] \| Eager (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git1afae24) PR-afgg \| Compiled (2.1.0a0+git16df542) Nightly \| speed-up PR vs Nightly \| Eager (2.1.0a0+git16df542) Nightly 1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear \| 91.971 (+-0.253) \| 90.570 (+-0.193) \| 137.206 (+-0.214) \| 1.515 (+-0.000) \| 84.280 (+-0.241) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear \| 91.893 (+-0.361) \| 89.866 (+-0.170) \| 136.678 (+-0.471) \| 1.521 (+-0.000) \| 84.573 (+-0.214) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear \| 116.967 (+-0.481) \| 110.468 (+-0.326) \| 223.770 (+-0.334) \| 2.026 (+-0.000) \| 108.098 (+-0.392) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear \| 117.563 (+-0.546) \| 111.438 (+-0.212) \| 223.101 (+-0.350) \| 2.002 (+-0.000) \| 108.225 (+-0.395) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest \| 80.706 (+-0.289) \| 70.525 (+-0.204) \| 143.697 (+-0.311) \| 2.038 (+-0.000) \| 74.485 (+-0.258) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest \| 80.955 (+-0.208) \| 69.986 (+-0.250) \| 143.658 (+-0.244) \| 2.053 (+-0.000) \| 74.163 (+-0.238) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest \| 117.576 (+-0.435) \| 71.179 (+-0.412) \| 178.515 (+-0.539) \| 2.508 (+-0.000) \| 108.394 (+-0.473) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest \| 117.441 (+-0.205) \| 70.313 (+-0.170) \| 178.664 (+-0.555) \| 2.541 (+-0.000) \| 108.098 (+-0.416) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic \| 92.962 (+-0.509) \| 1740.964 (+-0.597) \| 1785.401 (+-0.369) \| 1.026 (+-0.000) \| 92.638 (+-0.539) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic \| 92.928 (+-0.493) \| 1401.146 (+-0.732) \| 1453.229 (+-0.628) \| 1.037 (+-0.000) \| 92.458 (+-0.428) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic \| 118.152 (+-0.442) \| 1740.644 (+-0.480) \| 1793.475 (+-0.458) \| 1.030 (+-0.000) \| 107.962 (+-0.548) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic \| 118.182 (+-0.425) \| 1400.621 (+-0.624) \| 1461.796 (+-0.630) \| 1.044 (+-0.000) \| 107.894 (+-0.994) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709 Approved by: https://github.com/lezcano	2023-08-10 09:52:48 +00:00
Andy Rock	aa1b2f16c5	fix `upsample_nearest` decompositions for `uint8` tensors (#106675 ) Fixes #106674. This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: `68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675 Approved by: https://github.com/albanD	2023-08-07 01:52:41 +00:00
Kshiteej K	a899333ffc	fix: nll_loss batch rule with negative ignore_idx (#106118 ) We use python decompositions instead of writing our own for batching rules. Fixes https://github.com/pytorch/pytorch/issues/105736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118 Approved by: https://github.com/lezcano, https://github.com/zou3519	2023-08-04 07:43:02 +00:00
chunyuan	cb6c3cbc91	inductor: enable weight prepack for LSTM (#103071 ) - Enabled LSTM weight prepack in inductor. - Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different. - Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input. Update 2023-07-26: - https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in `3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)`). LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level. - Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module. When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here: `76fb72e24a/torch/nn/utils/stateless.py (L125-L128)` The forward function of LSTM will be called: `76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)` In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`: `76fb72e24a/torch/nn/modules/rnn.py (L775-L778)` The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here: `76fb72e24a/torch/nn/utils/stateless.py (L130-L142)` But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](`76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698)`)). At this moment, if we need to recompile the model, `deepcopy` will be called: `76fb72e24a/torch/_dynamo/utils.py (L915-L917)` And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect: `76fb72e24a/torch/_subclasses/meta_utils.py (L514)` Added a fix in the `__get_state__` to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071 Approved by: https://github.com/jgong5, https://github.com/jansel	2023-07-28 13:54:32 +00:00
lezcano	36ae359655	Update matmul decomp to match eager (#105850 ) The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850 Approved by: https://github.com/Chillee	2023-07-26 09:24:51 +00:00
Nikita Karetnikov	45e4706aff	[pt2] add decomps for `multilabel_margin_loss_forward` ops (#105302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Aaron Gokaslan	6d43c89f37	[BE]: Update Ruff to 0.0.280 (#105724 ) Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724 Approved by: https://github.com/ezyang, https://github.com/janeyx99	2023-07-22 23:03:34 +00:00
angelayi	fed8d3608d	Update core aten decomp table (#105673 ) Updated the decomposition table based on the existing [Core ATen IR](https://pytorch.org/docs/stable/ir.html) list, and moved rest of decompositions to inductor's decomposition table. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105673 Approved by: https://github.com/SherlockNoMad	2023-07-21 02:45:37 +00:00
Yanbo Liang	8daed86e4e	[Inductor] aten.dist decomposition (#105586 ) Fixes #105557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586 Approved by: https://github.com/desertfire, https://github.com/Chillee	2023-07-20 06:42:44 +00:00
Justin Chu	8a688277a2	[BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432 Approved by: https://github.com/ezyang	2023-07-19 13:48:44 +00:00
QSHLGZ	07108ff1e8	Fix typos under _decomp directory (#105210 ) Fix typos under _decomp directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210 Approved by: https://github.com/ezyang, https://github.com/Neilblaze	2023-07-17 11:41:30 +00:00
Peter Bell	9adfaf8807	[inductor] Add lowering for aten.unfold (#105165 ) The decomposition for unfold uses `as_strided` which forces the input to be realized. Instead, this implements it as a `GenericView` with reindexing which removes the need to realize, though it does call `mark_reuse` incase the input computation is expensive and the windows overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105165 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-07-16 13:09:23 +00:00
William Wen	5cd861fcf7	Add empty/empty_like to core aten decomps (#105158 ) Fixes https://github.com/pytorch/pytorch/issues/104871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105158 Approved by: https://github.com/SherlockNoMad	2023-07-15 18:48:55 +00:00
Nikita Karetnikov	7e72126487	[pt2] add decomps for `multi_margin_loss` ops (#104578 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-07-14 21:16:09 +00:00
Adnan Akhundov	4911b80b8e	[inductor] addmm + ReLU / GELU fusion pass (#104132 ) Summary: Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible. Test Plan: $ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v (__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( Using FallbackKernel: aten._addmm_activation.default Using FallbackKernel: aten._addmm_activation.default /data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect warnings.warn( frames [('total', 1), ('ok', 1)] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [] ok ---------------------------------------------------------------------- Ran 1 test in 13.415s OK Reviewers: @eellison Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132 Approved by: https://github.com/eellison, https://github.com/jansel	2023-07-10 16:44:14 +00:00
Jerry Zhang	1a661639f7	[quant] Support integer implementations for adaptive_avg_pool2d (#104226 ) Summary: This is needed for representing quantized model in pt2 export quantization flow Test Plan: tested by opinfo, python test/test_ops.py Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226 Approved by: https://github.com/jgong5, https://github.com/andrewor14	2023-07-07 19:36:31 +00:00
XiaobingSuper	d3589c9456	reduce computation of batch_norm when weight or bias is none (#104616 ) For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616 Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5	2023-07-06 00:47:41 +00:00
Peter Bell	5c580a9846	[decomp] Add test tracking core ATen operators (#104262 ) This adds an expect-test that finds the set of core ATen operators by subtracting the operators with decomposition in core_aten_decompositions from the set of all operators that have decompositions and could be decomposed. This is useful because if you add a new decomposition but forget to add it to the list of core decompositions, it will appear in the PR diff. Also, by going through this list I have identified some operators where the functional variant is decomposed, but not the inplace variant which must be an oversight. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104262 Approved by: https://github.com/lezcano	2023-07-04 16:41:44 +00:00
David Berard	0b62aca726	Don't decompose aten.bucketize (#104396 ) torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4]. The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes. Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104396 Approved by: https://github.com/Chillee	2023-06-30 05:05:08 +00:00
Peter Bell	8b418f197c	[decomp] Add decomposition for torch.renorm (#103858 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858 Approved by: https://github.com/ezyang, https://github.com/nkaretnikov	2023-06-21 20:57:43 +00:00
Peter Bell	591981c5e2	[inductor] Lower diagonal, diagonal_copy and diagonal_scatter (#103755 ) Currently these are decomposed into `as_strided`, which forces a buffer to be realized. Instead, this lowers them into a native inductor view node and so doesn't require any buffers to be realized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103755 Approved by: https://github.com/jansel	2023-06-21 20:16:24 +00:00
Peter Bell	a61096fb94	[decomp] Decompose logaddexp2 (#103765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765 Approved by: https://github.com/Chillee	2023-06-21 20:16:24 +00:00
Peter Bell	61cd605813	[decomp] Don't call .item() in aten.fill.Tensor decomp (#103880 ) Currently calling the fill.Tensor overload under `torch.compile` results in a `DataDependentOutputException` due to the `.item()` call. This instead does a device-device copy which can then be inlined into subsequent inductor kernels as you would expect, e.g. ```python def fn(a): result = torch.deg2rad(a).sin() return torch.empty((128, 128), device=a.device).fill_(result) ``` generates the single kernel ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 16384 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (0)) tmp1 = tl.broadcast_to(tmp0, [XBLOCK]) tmp2 = 0.017453292519943295 tmp3 = tmp1 * tmp2 tmp4 = tl.sin(tmp3) tl.store(out_ptr0 + (x0), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880 Approved by: https://github.com/Chillee	2023-06-21 18:45:04 +00:00
Kurt Mohler	ee83c646bb	Replace `_prims_common.check` with `torch._check` (#103240 ) This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240 Approved by: https://github.com/albanD	2023-06-21 00:46:17 +00:00
PyTorch MergeBot	7b6dc72ffa	Revert "[decomp] Decompose logaddexp2 (#103765 )" This reverts commit `bab21d20eb`. Reverted https://github.com/pytorch/pytorch/pull/103765 on behalf of https://github.com/ezyang due to looks like land race ([comment](https://github.com/pytorch/pytorch/pull/103765#issuecomment-1599030496))	2023-06-20 15:35:02 +00:00
Peter Bell	bab21d20eb	[decomp] Decompose logaddexp2 (#103765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765 Approved by: https://github.com/Chillee	2023-06-20 09:24:21 +00:00
Ivan Zaitsev	821493715c	Back out "Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 )", Back out "Forwatd fix for D46427687" (#103128 ) Test Plan: revertitparrot Reviewed By: malfet Differential Revision: D46506433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128 Approved by: https://github.com/malfet	2023-06-07 01:41:41 +00:00
Kurt Mohler	a84bb2709a	Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 ) Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-06-03 02:23:21 +00:00
PyTorch MergeBot	a7efa0ce35	Revert "Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 )" This reverts commit `fb79d43649`. Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))	2023-06-02 20:00:48 +00:00
Kurt Mohler	fb79d43649	Remove `check` from `_prims_common`, replace with `torch._check*` (#102219 ) Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219 Approved by: https://github.com/lezcano, https://github.com/albanD	2023-06-02 19:13:45 +00:00
Aleksandar Samardžić	51e0f9e858	Add missing decompositons/lowerings for logical/bitwise operators (#102566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566 Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5	2023-06-02 14:27:17 +00:00
Nikita Karetnikov	c3ea8cc58b	[pt2] convert `out` params in `register_meta` (#101344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101344 Approved by: https://github.com/lezcano	2023-05-27 18:38:52 +00:00
Animesh Jain	c2093de5d9	[partitioner] fix for rng ops (#102123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102123 Approved by: https://github.com/Chillee	2023-05-25 00:35:07 +00:00
Peter Bell	ce42010722	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-24 22:17:32 +00:00
vfdev-5	e3d97b6213	[inductor] Added `smooth_l1_loss` refs (#102077 ) Added `smooth_l1_loss` to refs + tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/102077 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-24 15:07:08 +00:00
Matthew Hoffman	29da75cc55	Enable mypy allow redefinition (#102046 ) Related #101528 I tried to enable this in another PR but it uncovered a bunch of type errors: https://github.com/pytorch/pytorch/actions/runs/4999748262/jobs/8956555243?pr=101528#step:10:1305 The goal of this PR is to fix these errors. --- This PR enables [allow_redefinition = True](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern: > Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition. `allow_redefinition` allows mypy to be more flexible by allowing reassignment to an existing variable with a different type... for instance (from the linked PR): `4a1e9230ba/torch/nn/parallel/data_parallel.py (L213)` A `Sequence[Union[int, torch.device]]` is narrowed to `Sequence[int]` thru reassignment to the same variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102046 Approved by: https://github.com/ezyang	2023-05-24 07:05:30 +00:00
PyTorch MergeBot	5147fe4969	Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 )" This reverts commit `b9721bd705`. Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))	2023-05-23 23:06:21 +00:00
Peter Bell	b9721bd705	[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812 Approved by: https://github.com/lezcano	2023-05-22 20:39:18 +00:00
Jason Ansel	0c6f409cda	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-20 03:43:33 +00:00
lezcano	1930428d89	Minor improvement on the decomposition of upsample_bilinear (#101682 ) This is how it's done in core. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101682 Approved by: https://github.com/ngimel	2023-05-18 16:51:51 +00:00
Peter Bell	66e398951a	[inductor/decomp] Add aten._unsafe_index to disable range checks (#101602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602 Approved by: https://github.com/lezcano, https://github.com/ngimel	2023-05-17 23:36:24 +00:00
PyTorch MergeBot	5f07c589b0	Revert "[inductor] Refactor RNG operators (#100064 )" This reverts commit `3bbf0683a1`. Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))	2023-05-17 21:16:41 +00:00
Jason Ansel	3bbf0683a1	[inductor] Refactor RNG operators (#100064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064 Approved by: https://github.com/ngimel	2023-05-17 01:29:31 +00:00
Thibaut Durand	01da732691	Fix type annotation of `torch.split` (#100655 ) The type annotation indicates `list` but the returned type is `tuple` ```python >>> import torch >>> type(torch.arange(10).split(4)) <class 'tuple'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100655 Approved by: https://github.com/kit1980	2023-05-16 21:35:41 +00:00
Jiong Gong	788ff0623b	[decomp] fix decomp of batch_norm when weight/bias is not flattened (#101059 ) Fix https://github.com/pytorch/pytorch/issues/100970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101059 Approved by: https://github.com/ezyang	2023-05-16 00:00:34 +00:00
Animesh Jain	e1021ec535	[decomp] Bad accuracy for elu_backward (#100284 ) Accuracy is tested by the full model at https://github.com/pytorch/pytorch/issues/100061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100284 Approved by: https://github.com/ngimel	2023-04-29 04:21:20 +00:00
Bin Bao	b66d7007d8	Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267 ) Summary: https://github.com/pytorch/pytorch/pull/100242 didn't cover all test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/100267 Approved by: https://github.com/jansel	2023-04-28 19:32:17 +00:00
yhl48	07c02b9e92	Add vmap support for `smooth_l1_loss_backward` (#99429 ) Follow-up of #98357 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99429 Approved by: https://github.com/kshitij12345, https://github.com/zou3519	2023-04-28 10:58:07 +00:00
Animesh Jain	a8ad0dc333	[philox_rand] Add decomps (#100206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100206 Approved by: https://github.com/ngimel	2023-04-28 02:20:13 +00:00
Angela Yi	d06b93b0c7	Decompose arange.default to arange.start_step (#99739 ) The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739 Approved by: https://github.com/SherlockNoMad	2023-04-27 19:06:36 +00:00
Animesh Jain	539363a873	[inductor] Lowering of rngprims philox_rand (#99289 ) An example graph with Dynamic shapes on `arg0_1` is seed, `arg1_1` is base offset. ~~~ ===== Forward graph 0 ===== <eval_with_key>.5 class <lambda>(torch.nn.Module): def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: f32[s0]): # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4605, code: a = torch.rand_like(x) * x add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0) philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32); add = None getitem: f32[s0] = philox_rand[0] getitem_1: i64[] = philox_rand[1]; philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0); getitem_1 = None mul: f32[s0] = torch.ops.aten.mul.Tensor(getitem, arg3_1); getitem = arg3_1 = None # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4606, code: a = torch.rand_like(x) * a add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1) philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32); arg2_1 = arg0_1 = add_2 = None getitem_2: f32[s0] = philox_rand_1[0] getitem_3: i64[] = philox_rand_1[1]; philox_rand_1 = None add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3); add_1 = getitem_3 = None mul_1: f32[s0] = torch.ops.aten.mul.Tensor(getitem_2, mul); getitem_2 = mul = None # No stacktrace found for following nodes add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3); arg1_1 = add_3 = None add_5: i64[] = torch.ops.aten.add.Tensor(add_4, 3); add_4 = None div: i64[] = torch.ops.aten.div.Tensor_mode(add_5, 4, rounding_mode = 'floor'); add_5 = None mul_2: i64[] = torch.ops.aten.mul.Tensor(div, 4); div = None return (mul_1, mul_2) ~~~ Note that the output `mul2` is basically total `numel` of the random ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99289 Approved by: https://github.com/jansel	2023-04-26 01:22:41 +00:00
Animesh Jain	6bc4651193	[philox_rand] Dynamic shape support (#99290 ) Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this ~~~ [2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH ===== Forward graph 1 ===== <eval_with_key>.7 class <lambda>(torch.nn.Module): def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]): # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0) philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32); add = None getitem: f32[s0, s1] = philox_rand[0] getitem_1: i64[] = philox_rand[1]; philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0); getitem_1 = None mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1); getitem = arg4_1 = None # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1) philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32); arg2_1 = arg3_1 = arg0_1 = add_2 = None getitem_2: f32[s0, s1] = philox_rand_1[0] getitem_3: i64[] = philox_rand_1[1]; philox_rand_1 = None add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3); add_1 = getitem_3 = None mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul); getitem_2 = mul = None # No stacktrace found for following nodes add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3); arg1_1 = add_3 = None return (mul_1, add_4) ~~~ Each rand op is accompanied by its offset calculation op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2023-04-25 22:40:28 +00:00
XiaobingSuper	41069f2faa	inductor: align inductor behavior with eager mode for split_with_sizes (#99702 ) Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR: ``` Traceback (most recent call last): File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node return node.target(args, kwargs) File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split return tensor.split(split_size_or_sections, dim) File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split return torch._VF.split_with_sizes(self, split_size, dim) File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch return decomposition_table[func](args, *kwargs) File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes raise ValueError( ValueError: Split sizes don't add up to the tensor's size in the given dimension The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value return wrap_fake_exception( File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception return fn() File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node raise RuntimeError( RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>((FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}): Split sizes don't add up to the tensor's size in the given dimension (scroll up for backtrace) The above exception was the direct cause of the following exception: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel	2023-04-25 01:13:52 +00:00
Will Constable	63690afc6c	Make CI error on inductor fallback when decomp is available (#99473 ) Fixes #99446 Remove the warning, as that annoyed end-users who don't know what to do about it. Instead, try to hold the line by preventing any decomp from being added without making the corresponding change to inductor's fallbacks. Note: we probably still need to better document how to update inductor's decomps, for now it's pretty much "go ask the inductor team for advice" Pull Request resolved: https://github.com/pytorch/pytorch/pull/99473 Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel	2023-04-21 05:47:28 +00:00
PyTorch MergeBot	5cb788a9a5	Revert "[cuda rng] Making offset calculation independent of device properties (#98988 )" This reverts commit `26f318574f`. Reverted https://github.com/pytorch/pytorch/pull/98988 on behalf of https://github.com/anijain2305 due to Diagnosing if sebotnet has flakiness	2023-04-19 17:23:40 +00:00
Animesh Jain	26f318574f	[cuda rng] Making offset calculation independent of device properties (#98988 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98988 Approved by: https://github.com/ngimel	2023-04-19 01:35:44 +00:00
Animesh Jain	fdbc8625a1	Functionalization of torch.rand/rand_like ops (#97377 ) This PR introduces the functionalization of RNG ops. Key points are * Introduces a new `philox_rand` prim operator that accepts seed, offset. * Adds decompositions for random operators that use these philox_rand prims * Adds a PhiloxStateTracker to track the offset for each occurence of rand ops * Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset> * Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior * Raises assertion for CPU because CPU does not Philox RNG. Not dealt in this PR * dropout op - offset calculation is different * other distributions like normal, poisson etc * Inductor support * Cudagraph support * Dynamic shape support An example ~~~ class Custom(torch.autograd.Function): @staticmethod def forward(ctx, x): ctx.save_for_backward(x) a = torch.rand_like(x) * x a = torch.rand_like(x) * a return a @staticmethod def backward(ctx, grad_out): x, = ctx.saved_tensors return grad_out * torch.rand_like(grad_out) * torch.cos(x) ====== Forward graph 0 ====== def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]): # No stacktrace found for following nodes add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0) philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32); add = None mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1); philox_rand = None add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4); fwd_base_offset_1 = None philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32); fwd_seed_1 = add_1 = None mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul); philox_rand_1 = mul = None return [mul_1, primals_1] ====== Backward graph 0 ====== def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]): # No stacktrace found for following nodes add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0); bwd_base_offset_1 = None philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32); bwd_seed_1 = add_2 = None mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2); tangents_1 = philox_rand_2 = None cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos); mul_2 = cos = None return [mul_3] ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377 Approved by: https://github.com/ezyang	2023-04-16 09:55:56 +00:00
Peter Bell	7b91bd2a7b	[primTorch] Add count_nonzero (#98995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98995 Approved by: https://github.com/lezcano	2023-04-13 22:08:19 +00:00

... 2 3 4 5 6 ...

590 Commits