pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
eqy	b407d98dbe	[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027 ) Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027 Approved by: https://github.com/ngimel	2022-12-31 06:58:04 +00:00
lezcano	484dd40022	Implement PReLU in a compositional way (#91238 ) The PReLU implementation was all over the place. This lead to a number of bugs like https://github.com/pytorch/pytorch/issues/68760. We fix it by: - Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel - This second kernel is just a good-ol' pointwise kernel. - We implement the derivative for the pointwise kernel via TI as well for speed. - We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally This fixes a number of issues: - We don't perform copies any more when the inputs are not contiguous - The derivatives are now correct - We fix vmap and many other functorch-related issues. - CPU and CUDA now share the relevant broadcasting logic - The implementation is about 1/3 the length. Fixes https://github.com/pytorch/pytorch/issues/68760 Fixes https://github.com/pytorch/pytorch/issues/89895 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238 Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD	2022-12-30 10:42:30 +00:00
Eddie Yan	8b617f813d	[cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172 ) Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100. CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172 Approved by: https://github.com/ngimel	2022-12-21 18:58:28 +00:00
Eddie Yan	dabf515c18	[cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117 ) Re-opening following #91025 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117 Approved by: https://github.com/ngimel	2022-12-20 18:52:29 +00:00
PyTorch MergeBot	ba7aeac37b	Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 )" This reverts commit `eecd621f06`. Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025	2022-12-16 23:06:35 +00:00
Rich Zhu	4372dbb89f	use pytree to allow any input format for cuda graph (#90941 ) Summary: 1. use pytree to allow any input format for make_graphed_callables 2. add allow_unused_input argument for make_graphed_callables Test Plan: buck2 test mode/dev-nosan //caffe2/test:cuda -- --print-passing-details Differential Revision: D42077976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941 Approved by: https://github.com/ngimel	2022-12-16 03:01:47 +00:00
Eddie Yan	eecd621f06	[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022 ) Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833 One new failure that seems to be surfaced with V8 on appears in halonext + amp ``` RuntimeError: Internal Triton PTX codegen error: Segmentation fault (core dumped) ``` But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck Current dynamo benchmarks on A100: v7 vs. v8 \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|166.0240 \|165.5798 \| \|cuda\|beit_base_patch16_224 \|64 \|123.5912 \|123.0797 \| \|cuda\|botnet26t_256 \|128 \|107.7343 \|107.5948 \| \|cuda\|cait_m36_384 \|4 \|184.5038 \|184.0271 \| \|cuda\|coat_lite_mini \|128 \|142.3061 \|140.5814 \| \|cuda\|convit_base \|64 \|165.2499 \|161.0743 \| \|cuda\|convmixer_768_32 \|32 \|325.6984 \|325.7094 \| \|cuda\|convnext_base \|64 \|237.4632 \|238.0142 \| \|cuda\|crossvit_9_240 \|128 \|72.2980 \|72.4367 \| \|cuda\|cspdarknet53 \|64 \|96.6862 \|96.8308 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|117.6045 \|117.9616 \| \|cuda\|dla102 \|128 \|182.3073 \|182.2304 \| \|cuda\|dm_nfnet_f0 \|128 \|133.6011 \|133.6298 \| \|cuda\|dpn107 \|32 \|148.5080 \|148.5885 \| \|cuda\|eca_botnext26ts_256 \|128 \|113.8676 \|113.1514 \| \|cuda\|eca_halonext26ts \|128 \|119.2242 \|119.1845 \| \|cuda\|ese_vovnet19b_dw \|128 \|80.0217 \|79.9438 \| \|cuda\|fbnetc_100 \|128 \|91.4548 \|91.4009 \| \|cuda\|fbnetv3_b \|128 \|115.4496 \|115.5058 \| \|cuda\|gernet_l \|128 \|114.8365 \|114.7870 \| \|cuda\|ghostnet_100 \|128 \|58.5766 \|58.5766 \| \|cuda\|gluon_inception_v3 \|128 \|165.5222 \|165.7167 \| \|cuda\|gluon_xception65 \|32 \|165.8779 \|165.7818 \| \|cuda\|gmixer_24_224 \|128 \|116.3611 \|113.4925 \| \|cuda\|gmlp_s16_224 \|128 \|121.2607 \|121.2534 \| \|cuda\|hrnet_w18 \|128 \|246.5706 \|246.7599 \| \|cuda\|inception_v3 \|128 \|166.1096 \|166.2034 \| \|cuda\|jx_nest_base \|32 \|93.6064 \|93.4088 \| \|cuda\|lcnet_050 \|128 \|21.4156 \|21.4207 \| \|cuda\|levit_128 \|128 \|27.2901 \|27.2543 \| \|cuda\|mixer_b16_224 \|128 \|157.8992 \|158.2878 \| \|cuda\|mixnet_l \|128 \|197.3443 \|197.2125 \| \|cuda\|mnasnet_100 \|128 \|71.4604 \|71.2997 \| \|cuda\|mobilenetv2_100 \|128 \|67.6080 \|67.7515 \| \|cuda\|mobilenetv3_large_100 \|128 \|57.7224 \|57.6591 \| \|cuda\|mobilevit_s \|64 \|93.0372 \|93.0530 \| \|cuda\|nfnet_l0 \|128 \|113.1664 \|113.2853 \| \|cuda\|pit_b_224 \|64 \|133.3333 \|133.4153 \| \|cuda\|pnasnet5large \|16 \|238.9545 \|238.8122 \| \|cuda\|poolformer_m36 \|64 \|144.2353 \|144.2375 \| \|cuda\|regnety_002 \|128 \|32.8534 \|32.9069 \| \|cuda\|repvgg_a2 \|128 \|102.4150 \|102.3827 \| \|cuda\|res2net101_26w_4s \|64 \|120.8127 \|120.8322 \| \|cuda\|res2net50_14w_8s \|128 \|149.7052 \|149.8969 \| \|cuda\|res2next50 \|128 \|153.7439 \|153.8215 \| \|cuda\|resmlp_12_224 \|128 \|89.1918 \|86.9226 \| \|cuda\|resnest101e \|64 \|159.4706 \|159.3133 \| \|cuda\|rexnet_100 \|128 \|88.0032 \|88.0397 \| \|cuda\|sebotnet33ts_256 \|64 \|80.4635 \|80.0120 \| \|cuda\|selecsls42b \|128 \|70.4430 \|70.3663 \| \|cuda\|spnasnet_100 \|128 \|78.0537 \|78.1991 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|212.9073 \|213.0824 \| \|cuda\|swsl_resnext101_32x16d \|32 \|193.0229 \|193.0404 \| \|cuda\|tf_efficientnet_b0 \|128 \|97.1316 \|97.0410 \| \|cuda\|tf_mixnet_l \|128 \|203.4956 \|203.5340 \| \|cuda\|tinynet_a \|128 \|82.4038 \|82.8733 \| \|cuda\|tnt_s_patch16_224 \|128 \|284.8576 \|284.8867 \| \|cuda\|twins_pcpvt_base \|64 \|118.3893 \|119.2329 \| \|cuda\|visformer_small \|128 \|126.0533 \|126.0390 \| \|cuda\|vit_base_patch16_224 \|64 \|118.2873 \|118.0573 \| \|cuda\|volo_d1_224 \|64 \|108.7764 \|108.2063 \| \|cuda\|xcit_large_24_p8_224 \|5 \|100.4656 \|100.5209 \| v7 vs. v8 amp \|dev \|name \|batch_size\|abs_latency_v7\|abs_latency_v8\| \|----\|-------------------------------\|----------\|--------------\|--------------\| \|cuda\|adv_inception_v3 \|128 \|104.9729 \|105.1237 \| \|cuda\|beit_base_patch16_224 \|64 \|75.4330 \|75.2039 \| \|cuda\|botnet26t_256 \|128 \|74.5149 \|74.8071 \| \|cuda\|cait_m36_384 \|4 \|110.9788 \|111.5170 \| \|cuda\|coat_lite_mini \|128 \|62.3618 \|64.4965 \| \|cuda\|convit_base \|64 \|116.4054 \|117.9129 \| \|cuda\|convmixer_768_32 \|32 \|264.4401 \|264.4491 \| \|cuda\|convnext_base \|64 \|182.9009 \|179.2136 \| \|cuda\|crossvit_9_240 \|128 \|48.8586 \|48.8359 \| \|cuda\|cspdarknet53 \|64 \|80.0245 \|80.0160 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|66.5921 \|66.7448 \| \|cuda\|dla102 \|128 \|116.7780 \|117.1683 \| \|cuda\|dm_nfnet_f0 \|128 \|78.9322 \|79.1135 \| \|cuda\|dpn107 \|32 \|85.5206 \|85.7514 \| \|cuda\|eca_botnext26ts_256 \|128 \|76.3672 \|77.0050 \| \|cuda\|eca_halonext26ts \|128 \|86.2458 \| \| \|cuda\|ese_vovnet19b_dw \|128 \|43.2943 \|43.3379 \| \|cuda\|fbnetc_100 \|128 \|54.8479 \|54.9251 \| \|cuda\|fbnetv3_b \|128 \|70.7504 \|71.0188 \| \|cuda\|gernet_l \|128 \|66.1607 \|66.0379 \| \|cuda\|ghostnet_100 \|128 \|43.8882 \|43.9336 \| \|cuda\|gluon_inception_v3 \|128 \|104.9297 \|105.0204 \| \|cuda\|gluon_xception65 \|32 \|85.7118 \|85.8370 \| \|cuda\|gmixer_24_224 \|128 \|75.1214 \|76.1170 \| \|cuda\|gmlp_s16_224 \|128 \|76.4207 \|76.6641 \| \|cuda\|hrnet_w18 \|128 \|186.1326 \|186.2435 \| \|cuda\|inception_v3 \|128 \|105.0561 \|105.0783 \| \|cuda\|jx_nest_base \|32 \|65.3066 \|65.3245 \| \|cuda\|lcnet_050 \|128 \|14.7991 \|14.8687 \| \|cuda\|levit_128 \|128 \|19.2893 \|19.4772 \| \|cuda\|mixer_b16_224 \|128 \|93.9826 \|94.2056 \| \|cuda\|mixnet_l \|128 \|147.1245 \|147.0435 \| \|cuda\|mnasnet_100 \|128 \|39.1781 \|39.2565 \| \|cuda\|mobilenetv2_100 \|128 \|42.3704 \|42.3114 \| \|cuda\|mobilenetv3_large_100 \|128 \|37.2946 \|37.2816 \| \|cuda\|mobilevit_s \|64 \|55.8930 \|55.8934 \| \|cuda\|nfnet_l0 \|128 \|64.0448 \|64.4438 \| \|cuda\|pit_b_224 \|64 \|80.6342 \|80.2933 \| \|cuda\|pnasnet5large \|16 \|154.9611 \|154.8654 \| \|cuda\|poolformer_m36 \|64 \|101.7489 \|101.8138 \| \|cuda\|regnety_002 \|128 \|27.0939 \|27.0309 \| \|cuda\|repvgg_a2 \|128 \|60.9651 \|61.2533 \| \|cuda\|res2net101_26w_4s \|64 \|77.3291 \|77.4739 \| \|cuda\|res2net50_14w_8s \|128 \|93.6572 \|93.7221 \| \|cuda\|res2next50 \|128 \|112.4975 \|112.3248 \| \|cuda\|resmlp_12_224 \|128 \|59.5422 \|60.7644 \| \|cuda\|resnest101e \|64 \|97.9894 \|98.3358 \| \|cuda\|rexnet_100 \|128 \|55.2218 \|55.0718 \| \|cuda\|sebotnet33ts_256 \|64 \|60.4880 \|60.8113 \| \|cuda\|selecsls42b \|128 \|41.4294 \|41.5341 \| \|cuda\|spnasnet_100 \|128 \|45.0037 \|45.0304 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.2561 \|98.6925 \| \|cuda\|swsl_resnext101_32x16d \|32 \|100.6179 \|100.9195 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.5344 \|56.4591 \| \|cuda\|tf_mixnet_l \|128 \|153.0318 \|152.9367 \| \|cuda\|tinynet_a \|128 \|54.1307 \|53.9298 \| \|cuda\|tnt_s_patch16_224 \|128 \|142.4801 \|142.6589 \| \|cuda\|twins_pcpvt_base \|64 \|67.9027 \|67.8325 \| \|cuda\|visformer_small \|128 \|72.5589 \|72.9427 \| \|cuda\|vit_base_patch16_224 \|64 \|71.4885 \|71.7342 \| \|cuda\|volo_d1_224 \|64 \|69.3539 \|69.5910 \| \|cuda\|xcit_large_24_p8_224 \|5 \|59.9000 \|59.9699 \| v7 vs. v8 float16 \|dev \|name \|batch_size\|abs_latency\|abs_latency\| \|----\|-------------------------------\|----------\|-----------\|-----------\| \|cuda\|adv_inception_v3 \|128 \|104.2544 \|104.2677 \| \|cuda\|beit_base_patch16_224 \|64 \|85.3601 \|85.3786 \| \|cuda\|botnet26t_256 \|128 \|72.1476 \|71.8277 \| \|cuda\|cait_m36_384 \|4 \|108.3075 \|108.5941 \| \|cuda\|coat_lite_mini \|128 \|61.2382 \|61.6049 \| \|cuda\|convmixer_768_32 \|32 \|263.3818 \|263.3598 \| \|cuda\|convnext_base \|64 \|172.6821 \|173.8520 \| \|cuda\|crossvit_9_240 \|128 \|44.6321 \|44.6340 \| \|cuda\|cspdarknet53 \|64 \|79.3165 \|79.2964 \| \|cuda\|deit_base_distilled_patch16_224\|64 \|61.9816 \|62.2109 \| \|cuda\|dla102 \|128 \|115.7403 \|115.9928 \| \|cuda\|dm_nfnet_f0 \|128 \|77.5434 \|77.7440 \| \|cuda\|dpn107 \|32 \|83.6489 \|83.5605 \| \|cuda\|eca_botnext26ts_256 \|128 \|73.9953 \|74.1031 \| \|cuda\|eca_halonext26ts \|128 \|81.7951 \|81.7103 \| \|cuda\|ese_vovnet19b_dw \|128 \|42.9618 \|42.8853 \| \|cuda\|fbnetc_100 \|128 \|54.3590 \|54.3575 \| \|cuda\|fbnetv3_b \|128 \|69.7977 \|70.1696 \| \|cuda\|gernet_l \|128 \|64.8684 \|65.1726 \| \|cuda\|ghostnet_100 \|128 \|43.2054 \|43.1319 \| \|cuda\|gluon_inception_v3 \|128 \|104.1988 \|104.3030 \| \|cuda\|gluon_xception65 \|32 \|84.2245 \|84.5085 \| \|cuda\|gmixer_24_224 \|128 \|82.0418 \|82.7252 \| \|cuda\|gmlp_s16_224 \|128 \|75.4792 \|75.8374 \| \|cuda\|hrnet_w18 \|128 \|184.1450 \|184.1848 \| \|cuda\|inception_v3 \|128 \|104.1203 \|104.2536 \| \|cuda\|jx_nest_base \|32 \|58.2386 \|58.4901 \| \|cuda\|lcnet_050 \|128 \|14.6409 \|14.5616 \| \|cuda\|levit_128 \|128 \|22.3875 \|22.4680 \| \|cuda\|mixer_b16_224 \|128 \|98.9534 \|98.4730 \| \|cuda\|mixnet_l \|128 \|146.1623 \|146.1947 \| \|cuda\|mnasnet_100 \|128 \|38.9208 \|39.3463 \| \|cuda\|mobilenetv2_100 \|128 \|41.8946 \|41.9847 \| \|cuda\|mobilenetv3_large_100 \|128 \|36.7810 \|36.8264 \| \|cuda\|mobilevit_s \|64 \|55.3211 \|55.3186 \| \|cuda\|nfnet_l0 \|128 \|63.1302 \|63.5544 \| \|cuda\|pit_b_224 \|64 \|73.8752 \|73.4602 \| \|cuda\|pnasnet5large \|16 \|151.6806 \|151.6111 \| \|cuda\|poolformer_m36 \|64 \|86.8341 \|86.8021 \| \|cuda\|regnety_002 \|128 \|26.6798 \|26.5295 \| \|cuda\|repvgg_a2 \|128 \|61.6652 \|62.1482 \| \|cuda\|res2net101_26w_4s \|64 \|75.8037 \|75.7739 \| \|cuda\|res2net50_14w_8s \|128 \|92.6362 \|92.4338 \| \|cuda\|res2next50 \|128 \|111.5371 \|111.5832 \| \|cuda\|resmlp_12_224 \|128 \|58.2349 \|57.9807 \| \|cuda\|resnest101e \|64 \|96.1114 \|96.2742 \| \|cuda\|rexnet_100 \|128 \|54.8138 \|54.7643 \| \|cuda\|sebotnet33ts_256 \|64 \|53.1524 \|53.3823 \| \|cuda\|selecsls42b \|128 \|40.6070 \|40.7104 \| \|cuda\|spnasnet_100 \|128 \|44.5732 \|44.4318 \| \|cuda\|swin_base_patch4_window7_224 \|64 \|98.6447 \|98.8445 \| \|cuda\|swsl_resnext101_32x16d \|32 \|97.0195 \|97.2968 \| \|cuda\|tf_efficientnet_b0 \|128 \|56.0640 \|56.0278 \| \|cuda\|tf_mixnet_l \|128 \|152.0958 \|152.0874 \| \|cuda\|tinynet_a \|128 \|53.3694 \|53.3762 \| \|cuda\|tnt_s_patch16_224 \|128 \|130.2981 \|130.3726 \| \|cuda\|twins_pcpvt_base \|64 \|62.5459 \|62.6416 \| \|cuda\|visformer_small \|128 \|68.8502 \|69.1756 \| \|cuda\|vit_base_patch16_224 \|64 \|65.8587 \|66.0285 \| \|cuda\|volo_d1_224 \|64 \|64.5348 \|64.6057 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022 Approved by: https://github.com/ngimel	2022-12-15 03:24:44 +00:00
PyTorch MergeBot	cba96366a2	Revert "remove torch.equal usages (#89527 )" This reverts commit `4095ef8b80`. Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests `4095ef8b80` https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502	2022-12-02 21:36:13 +00:00
Philip Meier	4095ef8b80	remove torch.equal usages (#89527 ) Preparation for the next PR in this stack: #89559. I replaced - `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`, - the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and - `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default). There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527 Approved by: https://github.com/mruberry	2022-12-01 11:22:52 +00:00
Aidyn-A	0057be3361	[CUDA graphs] Add warning if captured graph is empty (#88754 ) Fixes #87894 This PR adds a warning if captured graph is empty (consists of zero nodes). The example snippet where would it be useful: ```python import torch x = torch.randn(10) z = torch.zeros(10) g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): z = x * x # Warn user ``` and in #87894 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754 Approved by: https://github.com/ezyang	2022-11-28 23:20:19 +00:00
Nikita Shulga	da2afcb1e0	Add test for out-of-bounds Tensor access on GPU (#39211 ) Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error. As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard. Closes https://github.com/pytorch/pytorch/issues/38944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211 Approved by: https://github.com/ezyang	2022-11-15 21:06:02 +00:00
PyTorch MergeBot	d98a884b33	Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 )" This reverts commit `3c6bddc3f6`. Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions	2022-11-08 19:04:25 +00:00
Kurt Mohler	ee28b865ee	Deprecate TypedStorage, its derived classes, and all of their public methods (#85303 ) Part of #85302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303 Approved by: https://github.com/ezyang	2022-11-08 18:11:01 +00:00
Codrin Popa	5b767d404e	Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290 ) Summary: Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator. This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space. Test Plan: Tested locally Differential Revision: D40103909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290 Approved by: https://github.com/zdevito	2022-11-04 19:31:16 +00:00
eqy	3c6bddc3f6	[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669 ) #58414 Has a small tweak to a test that was breaking on A10 (CC @malfet). CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669 Approved by: https://github.com/ngimel	2022-11-02 01:36:37 +00:00
Masaki Kozuki	bc03aa6013	Store `autocast_gpu_dtype` in `custom_fwd` and `custom_bwd` for BFloat16 autocast (#88029 ) As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward. Changes: - store the `dtype` in `args[0]` - update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029 Approved by: https://github.com/ngimel	2022-10-31 22:45:26 +00:00
Zachary DeVito	00c91f4446	[allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250 ) Two tests were failing locally for me and don't appear to be run in our CI. Disabling them so we can otherwise refactor the allocators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250 Approved by: https://github.com/wconstab	2022-10-19 18:29:35 +00:00
PyTorch MergeBot	746500d58d	Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 )" This reverts commit `427e0a6b4e`. Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity	2022-10-14 14:25:51 +00:00
Eddie Yan	427e0a6b4e	[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948 ) #58414 Opening this PR for testing for now to check CI status. 🤞 CC @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948 Approved by: https://github.com/ngimel	2022-10-13 17:26:36 +00:00
Eddie Yan	25725fd624	(Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682 ) Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682 Approved by: https://github.com/ngimel	2022-10-12 03:44:21 +00:00
eqy	352d926482	[CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#86645 ) re-opening (again) in hopes of working around failed/stuck CLA check CC @ptrblck @ngimel @huydhn Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645 Approved by: https://github.com/zdevito	2022-10-11 16:03:49 +00:00
Zachary DeVito	91b1bae1df	Caching allocator tracing (#86241 ) We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time. We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm. As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught). This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241 Approved by: https://github.com/ngimel	2022-10-07 23:19:54 +00:00
Edward Z. Yang	adf5919720	Add option to record C++ backtraces in _record_memory_history (#86145 ) I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145 Approved by: https://github.com/albanD, https://github.com/zdevito	2022-10-06 04:07:37 +00:00
Zachary DeVito	736adc0808	Memory snapshots from C++ (#86190 ) Sometimes the driving process want to save memory snapshots but isn't Python. Add a simple API to turn it on without python stack traces. It still saves to the same format for the vizualization and summary scripts, using the C++ Pickler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/86190 Approved by: https://github.com/ezyang	2022-10-05 07:36:39 +00:00
PyTorch MergeBot	71eb04403c	Revert "[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 )" This reverts commit `b04b2fa9aa`. Reverted https://github.com/pytorch/pytorch/pull/85447 on behalf of https://github.com/seemethere due to Caused a CUDA memory leak, detected by our performance benchmark suite	2022-09-30 20:53:41 +00:00
Masaki Kozuki	5f26df0345	resubmit: "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" (#85739 ) Embarrassingly move the pow implementations around [ATen/native/cuda/PowKernel.cu#L21-L66](`849b08f14b/aten/src/ATen/native/cuda/PowKernel.cu (L21-L66)`) to a new header file and let FusedAdam use them to tame MSVC, hopefully. cc @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85739 Approved by: https://github.com/ngimel	2022-09-29 16:58:59 +00:00
Eddie Yan	b04b2fa9aa	[CUBLAS][CUDA GRAPHS] (re-re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85447 ) Now includes @dagitses 's optimizations and fixes for teardown CC @ngimel @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/85447 Approved by: https://github.com/malfet	2022-09-28 16:04:58 +00:00
Andres Lugo-Reyes	5709c67f1f	[ROCm] Retry loop implemented to avoid transient memory leak errors (#82607 ) ### Description Added a retry loop to memory leak checker to avoid rare case in which ROCM reports a false positive memory leak. ### Issue Original issue observed as part of this ticket: https://github.com/pytorch/pytorch/issues/62533 ### Testing - Applied changes and built - python test/test_cuda.py - Ensure all tests pass Pull Request resolved: https://github.com/pytorch/pytorch/pull/82607 Approved by: https://github.com/malfet	2022-09-28 15:48:24 +00:00
PyTorch MergeBot	7167996346	Revert "resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 )" This reverts commit `4615d1bcfa`. Reverted https://github.com/pytorch/pytorch/pull/85507 on behalf of https://github.com/atalman due to Break internal windows builds	2022-09-27 16:59:35 +00:00
Masaki Kozuki	4615d1bcfa	resubmit: [mta] APEX style Fused Adam (#81705 ) (#85507 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/85507 Approved by: https://github.com/ngimel	2022-09-23 18:56:00 +00:00
PyTorch MergeBot	e505360eb8	Revert "[mta] APEX style Fused Adam (#81705 )" This reverts commit `7a6c4d0c50`. Reverted https://github.com/pytorch/pytorch/pull/81705 on behalf of https://github.com/dagitses due to broke internal builds, details to come	2022-09-22 19:37:29 +00:00
PyTorch MergeBot	0ac6311356	Revert "[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 )" This reverts commit `4012e623e8`. Reverted https://github.com/pytorch/pytorch/pull/85292 on behalf of https://github.com/dagitses due to broke an internal test during shutdown. Re-submit with #85399 in stack	2022-09-21 17:57:49 +00:00
Masaki Kozuki	7a6c4d0c50	[mta] APEX style Fused Adam (#81705 ) This PR implements an APEX style FusedAdam in PyTorch. This is different from the APEX one in that this is compatible with `torch.cuda.amp.GradScaler` by setting `_step_supports_amp_scaling` to `True` and unscales gradients inside its CUDA kernel. related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167 possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436 cc @ptrblck @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/81705 Approved by: https://github.com/ngimel	2022-09-20 17:18:33 +00:00
eqy	4012e623e8	[CUBLAS][CUDA GRAPHS] (re-open of #83461 ) Explicitly set the workspace for cuBLAS handles (#85292 ) re-open of #83461 with fix for 10.2 build CC @ngimel @malfet Pull Request resolved: https://github.com/pytorch/pytorch/pull/85292 Approved by: https://github.com/malfet	2022-09-20 16:31:54 +00:00
Hector Yuen	d23ce29761	allow changing the cuda allocator settings even after the process started (#84970 ) Summary: - expose a python call to set the allocator settings, it uses the same format as the value for PYTORCH_CUDA_ALLOCATOR - keep the implementation contained within the cpp file to avoid increasing build times, only expose a function to call the setting - make some of the Allocator Config methods public, now it looks more like a singleton Test Plan: added the unit test Differential Revision: D39487522 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84970 Approved by: https://github.com/zdevito	2022-09-17 09:42:42 +00:00
PyTorch MergeBot	2711b9fa63	Revert "[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 )" This reverts commit `713d8b8552`. Reverted https://github.com/pytorch/pytorch/pull/83461 on behalf of https://github.com/malfet due to Broke CUDA-10.2 builds, see `713d8b8552`	2022-09-14 22:27:30 +00:00
Eddie Yan	713d8b8552	[CUBLAS][CUDA GRAPHS] Explicitly set the workspace for cuBLAS handles (#83461 ) We're seeing an issue where repeatedly capturing graphs incurs increasing memory usage as cuBLAS internally allocates a new workspace for each graph even when the same handle is being used: https://gist.github.com/tomconerlyanth/a20c04a4a46a0f6e9ce18f5280729b36 This PR works around the issue by intercepting the `CUBLAS_WORKSPACE_CONFIG` environment variable and allocating the workspace for the cuBLAS handle explicitly. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/83461 Approved by: https://github.com/ngimel	2022-09-14 21:56:48 +00:00
Aidyn-A	5271494ef2	[CUDA graphs] Fixes errors in RNG seed (#84967 ) Fixes #84614 Prior to this PR CUDAGraph did not store the RNG seed, that is why `torch.cuda.manual_seed(new_seed)` would only reset the offset but not update the seed at all keeping whatever value was used during graph capture. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84967 Approved by: https://github.com/ngimel	2022-09-14 19:56:12 +00:00
jataylo	09bcc006e9	ROCm support for test_lazy_init (#84333 ) Added ROCm support for the test_lazy_init unit test by including a condition on TEST_WITH_ROCM to switch CUDA_VISIBLE_DEVICES with HIP_VISIBLE_DEVICES. This is needed because HIP_VISIBLE_DEVICES is set when running the single-GPU tests in CI: `a47bc96fb7/.jenkins/pytorch/test.sh (L38)`, but this test sets CUDA_VISIBLE_DEVICES, which takes lower precedence than HIP_VISIBLE_DEVICES on ROCm. Testing Logs (to show behavior difference) 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 CUDA_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='0': 0 12:40:41 Aug 30 11:40:41 1 12:40:41 Aug 30 11:40:41 HIP_VISIBLE_DEVICES='32': 32 12:40:41 Aug 30 11:40:41 0 Passing UT Aug 30 17:03:15 test_lazy_init (main.TestCuda) Aug 30 17:03:17 Validate that no CUDA calls are made during import torch call ... ok (2.471s) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84333 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2022-09-09 14:14:59 +00:00
Fabio Rocha	88b1cc885c	Removed tri[lu]* tests, superseeded by OpInfos (#84256 ) triu, tril, triu_indices and tril_indices had some tests in test_tensor_creation_ops.py and test_cuda.py that are redudant with the ones done by OpInfos for those ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84256 Approved by: https://github.com/Lezcano, https://github.com/ngimel	2022-09-06 18:54:10 +00:00
Aidyn-A	ce1b727e77	Disable autocast cache in torch.cuda.make_graphed_callables (#84289 ) There there are conflicts between `torch.clear_autocast_cache()` and `cudaMallocAsync` from #82682. Moreover, the use of autocast caching is not reasonable during training which is the main target of `make_graphed_callables`. cc @eqy @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/84289 Approved by: https://github.com/ngimel	2022-09-01 21:34:51 +00:00
Pruthvi Madugundu	8473e69684	[ROCm] Fixes the kernel asserts API declaration mismatch error (#81790 ) This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040) The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled. Solution: For HIP we keep `__device__ __assert_fail()` and for host side compilation we want to use the `__assert_fail()` from the glibc library. Tested the code by compiling with below steps ``` python3 tools/amd_build/build_amd.py python3 setup.py develop --cmake-only cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build cmake --build build ``` The UT test_fixed_cuda_assert_async is still skipped due performance overhead. cc @jithunnair-amd Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790 Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet	2022-08-16 19:22:31 +00:00
Zachary DeVito	4128712397	Propagate CUDAOutOfMemoryError to Python. (#83146 ) The intention is to make it easier to catch this situation for debugging, logging, or application-specific recovery. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146 Approved by: https://github.com/albanD	2022-08-11 21:32:11 +00:00
Zachary DeVito	726d040692	annotated allocator snapshots (#82146 ) Record stack trace information for each allocated segment in the allocator. It takes around 1.5us to record 50 stack frames of context. Since invoking a Pytorch operator is around 8us, this adds minimal overhead but we still leave it disabled by default so that we can test it more on real workloads first. Stack information is kept both for allocated blocks and the last allocation used inactive blocks. We could potential keep around the _first_ allocation that caused the block to get allocated from cuda as well. Potential Followups: * stack frame entries are small (16 bytes), but the list of Frames is not compressed eventhough most frames will share some entries. So far this doesn't produce huge dumps (7MB for one real workload that uses all memory on the GPU), but it can be much smaller through compression. * Code to format the information is slow (a few seconds) because it uses python and FlameGraph.pl * Things allocated during the backward pass have no stack frames because they are run on another C++ thread. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82146 Approved by: https://github.com/albanD	2022-08-09 17:21:35 +00:00
Aidyn-A	da0a3fe058	[Re-land] [CUDA graphs] Clear autocast amp cache (#81896 ) Re-lands #81558 that got reverted due failing tests. This failure happened because of the test that I poorly designed. [The loop here](https://github.com/pytorch/pytorch/pull/81558/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R3837) is doing `cache_enabled=False` and then `cache_enabled=True`. By doing this loop the graph from previous iteration (case `False`) conflicts with the next one (case `True`). I redesigned the test such that it does not do any loops. The new test does separate function calls with different argument values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/81896 Approved by: https://github.com/ngimel	2022-08-02 23:22:00 +00:00
Kurt Mohler	14d0296e5c	Rename `_Typed/_UntypedStorage` to `Typed/UntypedStorage` and update docs (#82438 ) ### Description Since the major changes for `_TypedStorage` and `_UntypedStorage` are now complete, they can be renamed to be public. `TypedStorage._untyped()` is renamed to `TypedStorage.untyped()`. Documentation for storages is improved as well. ### Issue Fixes #82436 ### Testing N/A Pull Request resolved: https://github.com/pytorch/pytorch/pull/82438 Approved by: https://github.com/ezyang	2022-07-30 19:37:08 +00:00
Eddie Yan	0b2566456f	[CUDNN] Update tests and dispatching for CUDNN V8 API behavior for `bfloat16` convs (#81139 ) cuDNN via the V8 API supports `bfloat16` on Ampere (`>= (8, 0)` but not older devices) which might be unexpected given current test settings. This PR fixes some dispatching to check the device capability before dispatching `bfloat16` convs and adjusts the expected failure conditions for the autocast test. CC @xwang233 @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81139 Approved by: https://github.com/ngimel	2022-07-29 23:28:58 +00:00
PyTorch MergeBot	f5b460b200	Revert "[CUDA graphs] Clear autocast amp cache (#81558 )" This reverts commit `e9d07bd4f0`. Reverted https://github.com/pytorch/pytorch/pull/81558 on behalf of https://github.com/janeyx99 due to Breaks windows 11.6 tests on trunk `e9d07bd4f0`	2022-07-21 12:46:36 +00:00
Aidyn-A	e9d07bd4f0	[CUDA graphs] Clear autocast amp cache (#81558 ) According to [autocast_mode.cpp](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/autocast_mode.cpp) `cached_casts` is to be cleared at the end of each forward pass. However, this was not the case in current implementation of `make_graphed_callables` so a graph created the following way: ``` with torch.cuda.amp.autocast(cache_enabled=True): graphed_foo = torch.cuda.make_graphed_callables(foo, tensors) ``` Behaves incorrectly. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81558 Approved by: https://github.com/ngimel	2022-07-21 01:44:14 +00:00
Jeff Daily	ff6655defb	[ROCm] unskip external streams tests (#80922 ) These two tests are passing for ROCm 5.1.1 and 5.2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80922 Approved by: https://github.com/cpuhrsch	2022-07-08 21:29:29 +00:00

1 2 3 4 5 ...

556 Commits