pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	d28e9e145b	Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 )" This reverts commit `49c41b87a2`. Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk `49c41b87a2`	2022-06-10 20:55:10 +00:00
jjsjann123	49c41b87a2	[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147 Approved by: https://github.com/davidberard98	2022-06-10 19:37:42 +00:00
Akshay Parashar	28f87b9cf9	[Static Runtime] Fix aten::clone out variant (#78297 ) (#78322 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297 Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details. Crashing test case: a = tensor(3,1) // strides = (1,1) B = tensor(3,2) // strides = (2,1) Temp = a.expand_as(b). // creates temp with shape as (3,2) and strides as (1,0) temp.clone() // crashe on copy_ due to memoryOverlap Fix: Disable the out variant for the expanded tensor. - Calls native clone instead of out variant for clone dealing with expanded tensors - Added test case for both clone variants (out and native clones) - Increased the tensor size for memory planner test case to trigger dynamic allocation Test Plan: buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Differential Revision: D36672180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322 Approved by: https://github.com/mikeiovine	2022-06-02 21:06:59 +00:00
Max Podkorytov	ebfc70f37a	[static-runtime] out variant for aten::mean (#78161 ) Summary: As subject Test Plan: Added unit tests Differential Revision: D36614633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78161 Approved by: https://github.com/mikeiovine	2022-06-02 20:56:42 +00:00
Max Podkorytov	2679755bdc	[static-runtime] out variant for aten::max (#78271 ) Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim Test Plan: Added a unit test Differential Revision: D36656378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271 Approved by: https://github.com/mikeiovine	2022-05-26 23:29:27 +00:00
Hui Guo	d12bf9fd75	[static_runtime] Add auto-generated view ops (#77106 ) Summary: This includes the generated view ops from D36258767. Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest Differential Revision: D36258968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77106 Approved by: https://github.com/alanwaketan, https://github.com/tenpercent	2022-05-26 03:13:59 +00:00
mikeiovine	56c23f5633	[SR] Out variant for embedding_bag_byte_unpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661 Add an out variant and wrapper in static runtime. I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup. Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)! Approved by: https://github.com/tenpercent	2022-05-25 23:24:11 +00:00
mikeiovine	2ae3c59e4b	[SR] Remove linear/relu fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620 Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC. Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/) Approved by: https://github.com/hlu1	2022-05-23 21:46:27 +00:00
Hao Lu	c60d2ef4eb	[StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832 ) Reviewed By: mikeiovine Differential Revision: D36466622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832 Approved by: https://github.com/mikeiovine	2022-05-20 03:14:01 +00:00
jjsjann123	a2802ad0b9	Upstream master bump 0513 (#77471 ) Updating nvfuser code base. This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471 Approved by: https://github.com/seemethere, https://github.com/eellison	2022-05-18 11:48:50 -07:00
mikeiovine	02713221e3	[SR] Fuse clamp/nan_to_num Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094 Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor. Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)! Approved by: https://github.com/navahgar	2022-05-10 23:33:59 +00:00
Mike Iovine	849984a2cd	[SR] Sigmoid out variant calls fast_sigmoid (#75661 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661 `fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation). ghstack-source-id: 155604086 Reviewed By: navahgar, hlu1 Differential Revision: D35481390 fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1 (cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)	2022-05-06 00:14:30 +00:00
Mike Iovine	1fed6b7559	[SR] Eliminate extra permutes around softmax calls (#76391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391 I've seen this pattern in many important internal models: ``` x = torch.permute(a, [0, 2, 1]) y = torch.softmax(x, 2) z = torch.permute(y, [0, 2, 1]) ``` This is equivalent to ``` z = torch.softmax(x, 1) ``` The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this. ghstack-source-id: 155466506 Test Plan: New unit tests Reviewed By: navahgar, huiguoo Differential Revision: D35938289 fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9 (cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)	2022-05-04 23:08:49 +00:00
Mike Iovine	cac2733af1	[SR] Codegen for aten::clamp (#76340 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340 NNC kernel for `clamp` scalar case ghstack-source-id: 155466507 Reviewed By: navahgar, huiguoo Differential Revision: D35904019 fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f (cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)	2022-05-04 23:08:49 +00:00
Wang, Eikan	429a80dded	[NNC] Lowering function generates the output buffer with the specified stride (#76529 ) Summary: Pass stride information to lowering function to generate the output bufer with proper memory layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529 Reviewed By: ZolotukhinM Differential Revision: D36116712 Pulled By: IvanKobzarev fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929 (cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)	2022-05-04 20:04:22 +00:00
Hui Guo	bcddd4ab3e	[Static Runtime] Add auto generated unstructured ops (#76398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76398 This diff adds the large files that include the newly generated ops from D34913736. Refer to the base diff for more details. Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: mikeiovine, tenpercent Differential Revision: D35945633 fbshipit-source-id: 53497bd5c490a57ea1521837762f740deb42bfd8 (cherry picked from commit e0fbdcb0bf09f5c192430f95f450c0a946c80074)	2022-05-04 19:34:19 +00:00
Mike Iovine	fc64dbdc01	[SR] Fuse quantized linear/relu (#75775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775 fbgemm kernels already implement the fused kernel, no reason not to use it ghstack-source-id: 155450342 Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35633297 fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48 (cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)	2022-05-04 19:01:14 +00:00
Michael Suo	fb0f285638	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 20:51:34 +00:00
PyTorch MergeBot	3d7428d9ac	Revert "[lint] upgrade mypy to latest version" This reverts commit `9bf18aab94`. Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo	2022-05-03 20:01:18 +00:00
Michael Suo	9bf18aab94	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 19:43:28 +00:00
Mike Iovine	b02b3f25db	[SR] Quick hack to eliminate no-op slice (#75774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774 `list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models. Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35632947 fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d (cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)	2022-05-03 19:29:46 +00:00
Mike Iovine	3fa77fa51a	[SR] Fix quantized linear tests not managing outputs (#75776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75776 The output was returned directly instead of a clone, so the output of the relevant op would not be managed. ghstack-source-id: 154935103 Test Plan: CI Reviewed By: navahgar Differential Revision: D35633469 fbshipit-source-id: 7b08b7368e0349a12abf8802a4c625ffecdc5abb (cherry picked from commit 24bed9ba4da39cff7f3b40f5e49dfded2552b373)	2022-04-27 16:38:54 +00:00
Aaron Enye Shi	09a5b075fe	[libkineto] Re-enable user-annotations in PyTorch (#75601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75601 User annotations was previously pushed down to the GPU timelines but was disabled during a refactoring some time back. This patch re-enables it in PyTorch Profiler. Test Plan: CI Tests Reviewed By: chaekit Differential Revision: D34591916 Pulled By: aaronenyeshi fbshipit-source-id: 3f4d5327b391725f4ce4e3eb16740bac2cd1c618 (cherry picked from commit 4bc07174dfef8fb2ffbefba224773a4618ed203a)	2022-04-26 23:54:22 +00:00
zengk95	1d55518198	Revert "[nnc] Strides to Tensor (#72962 )" This reverts commit `939060925f`. Fixes https://github.com/pytorch/vision/issues/5873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332 Approved by: https://github.com/seemethere	2022-04-25 19:50:00 +00:00
Ivan Kobzarev	939060925f	[nnc] Strides to Tensor (#72962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962 Test Plan: Imported from OSS Reviewed By: ZolotukhinM, cpuhrsch Differential Revision: D34589306 Pulled By: IvanKobzarev fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944 (cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)	2022-04-23 19:35:15 +00:00
Ansha Yu	ee636e2fd1	[sr] remove max_indices argument of embedding_bag when unncessary (#75993 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993 Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0 {F723683014} More details in https://fb.quip.com/MKumAjz1YD4 (`1f47a80e88`)a#temp:C:FPD3 (`ecd5567980`)e5a0871ae5d481286b511ef7 The last 3 outputs of embedding_bag are unused in the graph: P495814049. * max_indices output isn't necessary for the main output, so remove it when it's not used in the graph. * offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph. * bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now. Test Plan: `./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only` Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604` Before: I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78 0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant) After: I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2 0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant) * Ads prod canary: https://www.internalfb.com/intern/ads/canary/443002539593035806/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594` https://www.internalfb.com/intern/servicelab/602875732/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594` https://www.internalfb.com/intern/servicelab/1002874745/ Reviewed By: mikeiovine Differential Revision: D35726594 fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a (cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)	2022-04-22 15:36:35 +00:00
vfdev-5	6593d293f7	Added functorch to functional_autograd_benchmark Description: - Following https://github.com/pytorch/functorch/issues/497 adding an option to run benchmarks with functorch and compare to original functional autograd results. Running the benchmark we get below table: <details> <summary> Table </summary> ``` \| model \| task \| mean \| var \| \| -- \| -- \| -- \| -- \| \| resnet18 \| vjp \| 0.03826599195599556 \| 4.3332115637895186e-06 \| \| resnet18 \| functorch vjp \| 0.037201929837465286 \| 6.139693198292662e-09 \| \| resnet18 \| vhp \| 0.2202976644039154 \| 2.8687209052691287e-08 \| \| resnet18 \| functorch vhp \| 0.22117868065834045 \| 4.108771278765744e-08 \| \| resnet18 \| jvp \| 0.18679651618003845 \| 1.832455254202614e-08 \| \| resnet18 \| functorch jvp \| 0.05305683612823486 \| 1.6690266946284282e-08 \| \| fcn_resnet \| vjp \| 0.6071907877922058 \| 7.436695454998699e-07 \| \| fcn_resnet \| functorch vjp \| 0.6115708947181702 \| 1.121692207561864e-06 \| \| fcn_resnet \| vhp \| 3.419469118118286 \| 0.020633839070796967 \| \| fcn_resnet \| jvp \| 2.5421929359436035 \| 3.1765587209520163e-06 \| \| fcn_resnet \| functorch jvp \| 0.7628333568572998 \| 1.4555752159139956e-07 \| \| detr \| vjp \| 0.19494840502738953 \| 1.9122715457342565e-05 \| \| detr \| vhp \| 1.1664292812347412 \| 0.000948643428273499 \| \| detr \| jvp \| 0.9990308880805969 \| 1.0214127541985363e-05 \| \| ppl_simple_reg \| vjp \| 0.0007535457843914628 \| 6.024204690646684e-09 \| \| ppl_simple_reg \| functorch vjp \| 0.0016954183811321855 \| 1.160151974488599e-08 \| \| ppl_simple_reg \| vhp \| 0.0011888503795489669 \| 5.93119386937957e-10 \| \| ppl_simple_reg \| functorch vhp \| 0.0026826143730431795 \| 1.6787025103326414e-08 \| \| ppl_simple_reg \| jvp \| 0.001067900680936873 \| 7.409912128331086e-10 \| \| ppl_simple_reg \| functorch jvp \| 0.002065300941467285 \| 9.710328185974504e-08 \| \| ppl_simple_reg \| hvp \| 0.001212477684020996 \| 1.974137298077494e-09 \| \| ppl_simple_reg \| functorch hvp \| 0.00482442369684577 \| 2.327668653379078e-07 \| \| ppl_simple_reg \| jacobian \| 0.0009108781814575195 \| 3.489469158068914e-09 \| \| ppl_simple_reg \| functorch jacobian \| 0.0019866942893713713 \| 1.938326299466553e-08 \| \| ppl_simple_reg \| hessian \| 0.005053090862929821 \| 3.370298600202659e-07 \| \| ppl_simple_reg \| functorch hessian \| 0.006374978926032782 \| 7.556796077778927e-08 \| \| ppl_simple_reg \| hessian_fwdrev \| 0.0036706924438476562 \| 1.996075527088692e-09 \| \| ppl_simple_reg \| functorch hessian_fwdrev \| 0.0058908225037157536 \| 7.548283775804521e-08 \| \| ppl_simple_reg \| hessian_revrev \| 0.0015769004821777344 \| 1.5754418214442012e-08 \| \| ppl_simple_reg \| functorch hessian_revrev \| 0.0041002752259373665 \| 6.713568723171193e-08 \| \| ppl_simple_reg \| jacfwd \| 0.0018048763740807772 \| 2.7375660849315864e-08 \| \| ppl_simple_reg \| functorch jacfwd \| 0.002047991845756769 \| 2.432247070416338e-09 \| \| ppl_simple_reg \| jacrev \| 0.0009733677143231034 \| 1.0078769818733235e-08 \| \| ppl_simple_reg \| functorch jacrev \| 0.0021971464157104492 \| 1.2729884701911942e-08 \| \| ppl_robust_reg \| vjp \| 0.005820560269057751 \| 8.582588151284654e-08 \| \| ppl_robust_reg \| functorch vjp \| 0.00796132069081068 \| 9.663100541956737e-09 \| \| ppl_robust_reg \| vhp \| 0.009825301356613636 \| 2.0081762386325863e-07 \| \| ppl_robust_reg \| functorch vhp \| 0.014890861697494984 \| 4.558066279969353e-07 \| \| ppl_robust_reg \| jvp \| 0.008297419175505638 \| 2.9454400873873965e-07 \| \| ppl_robust_reg \| functorch jvp \| 0.008052706718444824 \| 7.120377176761394e-08 \| \| ppl_robust_reg \| hvp \| 0.015414690598845482 \| 7.42123745567369e-07 \| \| ppl_robust_reg \| functorch hvp \| 0.02699306048452854 \| 1.4650488537881756e-06 \| \| ppl_robust_reg \| jacobian \| 0.006207776255905628 \| 1.7068457225377642e-07 \| \| ppl_robust_reg \| functorch jacobian \| 0.009173822589218616 \| 1.2214455580306094e-07 \| \| ppl_robust_reg \| hessian \| 0.04670915752649307 \| 1.4299343092716299e-05 \| \| ppl_robust_reg \| functorch hessian \| 0.02337808534502983 \| 3.0397418413485866e-06 \| \| ppl_robust_reg \| hessian_fwdrev \| 0.024229884147644043 \| 2.0425247839739313e-06 \| \| ppl_robust_reg \| functorch hessian_fwdrev \| 0.022021746262907982 \| 3.512146236062108e-07 \| \| ppl_robust_reg \| hessian_revrev \| 0.012355780228972435 \| 7.090877147675201e-07 \| \| ppl_robust_reg \| functorch hessian_revrev \| 0.013960313983261585 \| 6.326549737423193e-07 \| \| ppl_robust_reg \| jacfwd \| 0.008112502284348011 \| 2.88503088086145e-08 \| \| ppl_robust_reg \| functorch jacfwd \| 0.008947920985519886 \| 4.2070990247111695e-08 \| \| ppl_robust_reg \| jacrev \| 0.00635871896520257 \| 1.3403841592207755e-07 \| \| ppl_robust_reg \| functorch jacrev \| 0.009123563766479492 \| 2.677554675756255e-07 \| \| wav2letter \| vjp \| 0.02078995667397976 \| 2.1110793113621185e-06 \| \| wav2letter \| functorch vjp \| 0.019202351570129395 \| 9.210506135559626e-09 \| \| wav2letter \| vhp \| 0.05997290462255478 \| 8.558587616391833e-09 \| \| wav2letter \| functorch vhp \| 0.06035261228680611 \| 1.6448565842708263e-09 \| \| wav2letter \| jvp \| 0.04507789760828018 \| 1.5771547401399744e-09 \| \| wav2letter \| functorch jvp \| 0.013057494536042213 \| 3.804750292601966e-09 \| \| deepspeech \| vjp \| 0.3648746609687805 \| 1.5359055396402255e-05 \| \| transformer \| vjp \| 0.05496881157159805 \| 1.242562319703211e-08 \| \| transformer \| functorch vjp \| 0.057835936546325684 \| 2.6113376350167528e-08 \| \| transformer \| vhp \| 0.18313491344451904 \| 7.226336151688884e-08 \| \| transformer \| jvp \| 0.13924935460090637 \| 1.6989159234981344e-07 \| \| multiheadattn \| vjp \| 0.0014708995586261153 \| 3.710916729460223e-08 \| \| multiheadattn \| functorch vjp \| 0.002404856728389859 \| 2.1910574687922235e-08 \| \| multiheadattn \| vhp \| 0.003382015274837613 \| 5.3098595742540056e-08 \| \| multiheadattn \| functorch vhp \| 0.005340623669326305 \| 5.897558708056749e-08 \| \| multiheadattn \| jvp \| 0.0027526854537427425 \| 3.508620949332908e-08 \| \| multiheadattn \| functorch jvp \| 0.0022981404326856136 \| 1.327894807445773e-07 \| ``` </details> <details> <summary> Stdout </summary> ``` Found functorch: 0.2.0a0+386a541 Results for model resnet18 on task vjp: 0.03826599195599556s (var: 4.3332115637895186e-06) Results for model resnet18 on task vjp using Functorch: 0.037201929837465286s (var: 6.139693198292662e-09) Results for model resnet18 on task vhp: 0.2202976644039154s (var: 2.8687209052691287e-08) Results for model resnet18 on task vhp using Functorch: 0.22117868065834045s (var: 4.108771278765744e-08) Results for model resnet18 on task jvp: 0.18679651618003845s (var: 1.832455254202614e-08) Results for model resnet18 on task jvp using Functorch: 0.05305683612823486s (var: 1.6690266946284282e-08) Results for model fcn_resnet on task vjp: 0.6071907877922058s (var: 7.436695454998699e-07) Results for model fcn_resnet on task vjp using Functorch: 0.6115708947181702s (var: 1.121692207561864e-06) Results for model fcn_resnet on task vhp: 3.419469118118286s (var: 0.020633839070796967) Failed model using Functorch: fcn_resnet, task: vhp, Error message: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 45.62 GiB already allocated; 5.31 MiB free; 46.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Results for model fcn_resnet on task jvp: 2.5421929359436035s (var: 3.1765587209520163e-06) Results for model fcn_resnet on task jvp using Functorch: 0.7628333568572998s (var: 1.4555752159139956e-07) Results for model detr on task vjp: 0.19494840502738953s (var: 1.9122715457342565e-05) Failed model using Functorch: detr, task: vjp, Error message: Cannot access data pointer of Tensor that doesn't have storage Results for model detr on task vhp: 1.1664292812347412s (var: 0.000948643428273499) Failed model using Functorch: detr, task: vhp, Error message: Cannot access data pointer of Tensor that doesn't have storage Results for model detr on task jvp: 0.9990308880805969s (var: 1.0214127541985363e-05) Failed model using Functorch: detr, task: jvp, Error message: Trying to use forward AD with _cdist_forward that does not support it because it has not been implemented yet. Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation. Results for model ppl_simple_reg on task vjp: 0.0007535457843914628s (var: 6.024204690646684e-09) Results for model ppl_simple_reg on task vjp using Functorch: 0.0016954183811321855s (var: 1.160151974488599e-08) Results for model ppl_simple_reg on task vhp: 0.0011888503795489669s (var: 5.93119386937957e-10) Results for model ppl_simple_reg on task vhp using Functorch: 0.0026826143730431795s (var: 1.6787025103326414e-08) Results for model ppl_simple_reg on task jvp: 0.001067900680936873s (var: 7.409912128331086e-10) Results for model ppl_simple_reg on task jvp using Functorch: 0.002065300941467285s (var: 9.710328185974504e-08) Results for model ppl_simple_reg on task hvp: 0.001212477684020996s (var: 1.974137298077494e-09) Results for model ppl_simple_reg on task hvp using Functorch: 0.00482442369684577s (var: 2.327668653379078e-07) Results for model ppl_simple_reg on task jacobian: 0.0009108781814575195s (var: 3.489469158068914e-09) Results for model ppl_simple_reg on task jacobian using Functorch: 0.0019866942893713713s (var: 1.938326299466553e-08) Results for model ppl_simple_reg on task hessian: 0.005053090862929821s (var: 3.370298600202659e-07) Results for model ppl_simple_reg on task hessian using Functorch: 0.006374978926032782s (var: 7.556796077778927e-08) Results for model ppl_simple_reg on task hessian_fwdrev: 0.0036706924438476562s (var: 1.996075527088692e-09) Results for model ppl_simple_reg on task hessian_fwdrev using Functorch: 0.0058908225037157536s (var: 7.548283775804521e-08) Results for model ppl_simple_reg on task hessian_revrev: 0.0015769004821777344s (var: 1.5754418214442012e-08) Results for model ppl_simple_reg on task hessian_revrev using Functorch: 0.0041002752259373665s (var: 6.713568723171193e-08) Results for model ppl_simple_reg on task jacfwd: 0.0018048763740807772s (var: 2.7375660849315864e-08) Results for model ppl_simple_reg on task jacfwd using Functorch: 0.002047991845756769s (var: 2.432247070416338e-09) Results for model ppl_simple_reg on task jacrev: 0.0009733677143231034s (var: 1.0078769818733235e-08) Results for model ppl_simple_reg on task jacrev using Functorch: 0.0021971464157104492s (var: 1.2729884701911942e-08) Results for model ppl_robust_reg on task vjp: 0.005820560269057751s (var: 8.582588151284654e-08) Results for model ppl_robust_reg on task vjp using Functorch: 0.00796132069081068s (var: 9.663100541956737e-09) Results for model ppl_robust_reg on task vhp: 0.009825301356613636s (var: 2.0081762386325863e-07) Results for model ppl_robust_reg on task vhp using Functorch: 0.014890861697494984s (var: 4.558066279969353e-07) Results for model ppl_robust_reg on task jvp: 0.008297419175505638s (var: 2.9454400873873965e-07) Results for model ppl_robust_reg on task jvp using Functorch: 0.008052706718444824s (var: 7.120377176761394e-08) Results for model ppl_robust_reg on task hvp: 0.015414690598845482s (var: 7.42123745567369e-07) Results for model ppl_robust_reg on task hvp using Functorch: 0.02699306048452854s (var: 1.4650488537881756e-06) Results for model ppl_robust_reg on task jacobian: 0.006207776255905628s (var: 1.7068457225377642e-07) Results for model ppl_robust_reg on task jacobian using Functorch: 0.009173822589218616s (var: 1.2214455580306094e-07) Results for model ppl_robust_reg on task hessian: 0.04670915752649307s (var: 1.4299343092716299e-05) Results for model ppl_robust_reg on task hessian using Functorch: 0.02337808534502983s (var: 3.0397418413485866e-06) Results for model ppl_robust_reg on task hessian_fwdrev: 0.024229884147644043s (var: 2.0425247839739313e-06) Results for model ppl_robust_reg on task hessian_fwdrev using Functorch: 0.022021746262907982s (var: 3.512146236062108e-07) Results for model ppl_robust_reg on task hessian_revrev: 0.012355780228972435s (var: 7.090877147675201e-07) Results for model ppl_robust_reg on task hessian_revrev using Functorch: 0.013960313983261585s (var: 6.326549737423193e-07) Results for model ppl_robust_reg on task jacfwd: 0.008112502284348011s (var: 2.88503088086145e-08) Results for model ppl_robust_reg on task jacfwd using Functorch: 0.008947920985519886s (var: 4.2070990247111695e-08) Results for model ppl_robust_reg on task jacrev: 0.00635871896520257s (var: 1.3403841592207755e-07) Results for model ppl_robust_reg on task jacrev using Functorch: 0.009123563766479492s (var: 2.677554675756255e-07) Results for model wav2letter on task vjp: 0.02078995667397976s (var: 2.1110793113621185e-06) Results for model wav2letter on task vjp using Functorch: 0.019202351570129395s (var: 9.210506135559626e-09) Results for model wav2letter on task vhp: 0.05997290462255478s (var: 8.558587616391833e-09) Results for model wav2letter on task vhp using Functorch: 0.06035261228680611s (var: 1.6448565842708263e-09) Results for model wav2letter on task jvp: 0.04507789760828018s (var: 1.5771547401399744e-09) Results for model wav2letter on task jvp using Functorch: 0.013057494536042213s (var: 3.804750292601966e-09) Results for model deepspeech on task vjp: 0.3648746609687805s (var: 1.5359055396402255e-05) Failed model using Functorch: deepspeech, task: vjp, Error message: Cannot access storage of TensorWrapper Results for model transformer on task vjp: 0.05496881157159805s (var: 1.242562319703211e-08) Results for model transformer on task vjp using Functorch: 0.057835936546325684s (var: 2.6113376350167528e-08) Results for model transformer on task vhp: 0.18313491344451904s (var: 7.226336151688884e-08) Failed model using Functorch: transformer, task: vhp, Error message: bad optional access Results for model transformer on task jvp: 0.13924935460090637s (var: 1.6989159234981344e-07) Failed model using Functorch: transformer, task: jvp, Error message: Trying to use forward AD with embedding that does not support it because it has not been implemented yet. Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation. Results for model multiheadattn on task vjp: 0.0014708995586261153s (var: 3.710916729460223e-08) Results for model multiheadattn on task vjp using Functorch: 0.002404856728389859s (var: 2.1910574687922235e-08) Results for model multiheadattn on task vhp: 0.003382015274837613s (var: 5.3098595742540056e-08) Results for model multiheadattn on task vhp using Functorch: 0.005340623669326305s (var: 5.897558708056749e-08) Results for model multiheadattn on task jvp: 0.0027526854537427425s (var: 3.508620949332908e-08) Results for model multiheadattn on task jvp using Functorch: 0.0022981404326856136s (var: 1.327894807445773e-07) ``` </details> All functorch errors are reported in its repository. cc @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75689 Approved by: https://github.com/zou3519	2022-04-22 14:04:26 +00:00
Mike Iovine	b6a4234090	[SR] Fix broken unit test build (#76111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76111 https://github.com/pytorch/pytorch/pull/68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this ghstack-source-id: 154335722 Test Plan: CI Reviewed By: navahgar, ajyu Differential Revision: D35780296 fbshipit-source-id: 0a262eb06a8d619227e5db10b6a775bf0b2e17c1 (cherry picked from commit aea6fbf9365391011df5211164e3978075d7a5cb)	2022-04-20 18:36:31 +00:00
mikeiovine	98b4a4100d	[SR] Add a copy variant for fused_split_and_squeeze Pull Request resolved: https://github.com/pytorch/pytorch/pull/75660 The outputs of `split_and_squeeze` are passed to `VarStack` in models we care about. `VarStack` has a [fast path](https://www.internalfb.com/code/fbsource/[893193f5277184fd17f4ea3f28fe415a4df37707]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=296-298) for when all of its inputs have the same strides. Hitting the slow path adds a ton of extra overhead - so much that it's worth it to copy in `split_and_squeeze` and force all of `VarStack`'s inputs to be contiguous so we can take advantage of the fast path. Differential Revision: [D35513777](https://our.internmc.facebook.com/intern/diff/D35513777/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35513777/)! Approved by: https://github.com/hlu1	2022-04-13 20:02:01 +00:00
Yulv-git	ac2d2e3a3d	Fix some typos. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561 Approved by: https://github.com/albanD	2022-04-11 21:55:59 +00:00
Nikita Shulga	80ea6955af	Add cuda-11.3+clang9 build workflow (take 2) To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 17:13:01 +00:00
PyTorch MergeBot	8fe43d76d5	Revert "Add cuda-11.3+clang9 build workflow" This reverts commit `709fcc862e`. Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99	2022-04-11 15:24:59 +00:00
Nikita Shulga	709fcc862e	Add cuda-11.3+clang9 build workflow To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 14:10:57 +00:00
Mike Iovine	2f98fa9147	[SR] Do not manage tensors that escape scope via container (#74966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966 It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this: ``` def forward(self, cond: bool, a, b): lst = [] if cond: res = a + b # res should not be managed!!! lst.append(res) return lst ``` The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`. The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`. This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy. ghstack-source-id: 153288188 Test Plan: New unit tests cover the problematic case Reviewed By: navahgar Differential Revision: D35257087 fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7 (cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)	2022-04-07 19:57:57 +00:00
Mike Iovine	4055d1f653	[SR] Fix StaticRuntime move ctor (#74927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927 The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated. Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue. ghstack-source-id: 153168602 Test Plan: New unit test/CI Reviewed By: navahgar Differential Revision: D35228467 fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3 (cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)	2022-04-07 02:16:37 +00:00
Don Jang	85e163c56b	[Static Runtime] Fix a bug that `aten::full_like` reuses a tensor that does not match arguments (#74255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255 This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed. Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path. Reviewed By: mikeiovine Differential Revision: D34863639 fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608 (cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)	2022-04-05 22:30:41 +00:00
Raghavan Raman	60bda4d06b	[Static Runtime] Fix handling relu in quantized linear relu dynamic op Summary: The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter. At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue. This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16 ``` Reviewed By: mikeiovine Differential Revision: D35366309 fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9 (cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)	2022-04-04 22:16:22 +00:00
Max Podkorytov	11c412a8ec	[static-runtime] optimize empty if blocks at runtime (#74987 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987 Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty Test Plan: `buck build //caffe2:torch-cpp-cpu` `buck test //caffe2/benchmarks/static_runtime/...` Add unit test: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock` Reviewed By: mikeiovine Differential Revision: D35262952 fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2 (cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)	2022-04-01 05:43:33 +00:00
Norman Ponte	2e8b9c7785	[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)	2022-04-01 00:35:42 +00:00
jjsjann123	873ced7cd0	Nvfuser code bump 030122 (#73627 ) Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)	2022-03-31 08:18:22 +00:00
Mike Iovine	2ca66ffb7d	[SR] Force split_and_squeeze usage via graph transformation (#74274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274 Reviewed By: navahgar Differential Revision: D34913889 fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458 (cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)	2022-03-29 19:13:40 +00:00
Elias Ellison	6694fdaccd	Clean up profiling mode and profiling executor strategy (#73875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875 Previously we had a few settings: - getExecutor - which toggled between Profiling Executor and Legacy - getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations) and then... - getProfilingMode - which would set PE to 0 specializtions. The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93. The tests here are failing but get fixed with the PR above it, so i'll squash for landing. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D34938130 Pulled By: eellison fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b (cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)	2022-03-29 18:38:51 +00:00
Mike Iovine	3f37337ed0	[SR] Native implementation for reshape_as (#74585 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585 Native static runtime for `aten::reshape_as` ghstack-source-id: 152340038 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35060895 fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701 (cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)	2022-03-28 17:02:14 +00:00
Mike Iovine	9f2344aa40	[SR] Native implementation for select (#74568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568 Native static runtime implementation for `aten::select(Tensor, int, int)` overload ghstack-source-id: 152340037 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35053900 fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f (cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)	2022-03-28 17:02:14 +00:00
Mike Iovine	facdbe6d72	[SR] Native implementation for IntImplicit (#74562 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562 Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do ghstack-source-id: 152340039 Test Plan: New unit tests Reviewed By: hlu1 Differential Revision: D35052997 fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42 (cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)	2022-03-28 17:02:14 +00:00
Mike Iovine	f5a9c36d0b	[SR] Eliminate extra permute ops before `aten::sum` (#74481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f (cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)	2022-03-23 23:00:14 +00:00
Don Jang	6294a2eb7f	[Static Runtime] Add out variant wrapper for aten::index_select (#74321 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321 This change adds out variant wrapper for aten::index_select. Test Plan: Added a unittest Reviewed By: mikeiovine Differential Revision: D34928012 fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283 (cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)	2022-03-16 23:43:21 +00:00
Mike Iovine	f14a0be302	[SR] Avoid allocating rstd/mean in layer_norm (#73606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606 The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely. ghstack-source-id: 151394116 Test Plan: Existing unit tests Reviewed By: hlu1 Differential Revision: D34562131 fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de (cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)	2022-03-15 22:07:11 +00:00
Don Jang	381c0c080f	[Static Runtime] Fix a bug that `aten::full` reuses a tensor that does not match requested one (#73990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990 This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed. This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following. Test Plan: - Added a unittest. Reviewed By: mikeiovine Differential Revision: D34768718 fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9 (cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)	2022-03-15 16:13:52 +00:00
Don Jang	1b80f609b0	[Static Runtime] Add out variant wrapper for aten::ones_like (#73945 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945 This change adds add out variant wrapper for aten::ones_like. Test Plan: - Added a unittest. - Checked that the op execution got switched to its added out variant (P485330978). Reviewed By: hlu1 Differential Revision: D34727057 fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5 (cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)	2022-03-14 20:29:58 +00:00

1 2 3 4 5 ...

719 Commits