pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	d28e9e145b	Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 )" This reverts commit `49c41b87a2`. Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk `49c41b87a2`	2022-06-10 20:55:10 +00:00
jjsjann123	49c41b87a2	[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147 ) Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Bug fixes and minor refactor Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` 4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725) 02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753) 8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746) ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738) 02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745) 465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744) 26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742) 856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736) 1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732) de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733) fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728) b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729) 5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727) ``` RUN_TORCHBENCH: nvfuser Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147 Approved by: https://github.com/davidberard98	2022-06-10 19:37:42 +00:00
Akshay Parashar	28f87b9cf9	[Static Runtime] Fix aten::clone out variant (#78297 ) (#78322 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297 Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details. Crashing test case: a = tensor(3,1) // strides = (1,1) B = tensor(3,2) // strides = (2,1) Temp = a.expand_as(b). // creates temp with shape as (3,2) and strides as (1,0) temp.clone() // crashe on copy_ due to memoryOverlap Fix: Disable the out variant for the expanded tensor. - Calls native clone instead of out variant for clone dealing with expanded tensors - Added test case for both clone variants (out and native clones) - Increased the tensor size for memory planner test case to trigger dynamic allocation Test Plan: buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Differential Revision: D36672180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322 Approved by: https://github.com/mikeiovine	2022-06-02 21:06:59 +00:00
Max Podkorytov	ebfc70f37a	[static-runtime] out variant for aten::mean (#78161 ) Summary: As subject Test Plan: Added unit tests Differential Revision: D36614633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78161 Approved by: https://github.com/mikeiovine	2022-06-02 20:56:42 +00:00
Max Podkorytov	2679755bdc	[static-runtime] out variant for aten::max (#78271 ) Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim Test Plan: Added a unit test Differential Revision: D36656378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271 Approved by: https://github.com/mikeiovine	2022-05-26 23:29:27 +00:00
Hui Guo	d12bf9fd75	[static_runtime] Add auto-generated view ops (#77106 ) Summary: This includes the generated view ops from D36258767. Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest Differential Revision: D36258968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77106 Approved by: https://github.com/alanwaketan, https://github.com/tenpercent	2022-05-26 03:13:59 +00:00
mikeiovine	56c23f5633	[SR] Out variant for embedding_bag_byte_unpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661 Add an out variant and wrapper in static runtime. I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup. Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)! Approved by: https://github.com/tenpercent	2022-05-25 23:24:11 +00:00
mikeiovine	2ae3c59e4b	[SR] Remove linear/relu fusion Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620 Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC. Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/) Approved by: https://github.com/hlu1	2022-05-23 21:46:27 +00:00
Hao Lu	c60d2ef4eb	[StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832 ) Reviewed By: mikeiovine Differential Revision: D36466622 Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832 Approved by: https://github.com/mikeiovine	2022-05-20 03:14:01 +00:00
jjsjann123	a2802ad0b9	Upstream master bump 0513 (#77471 ) Updating nvfuser code base. This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471 Approved by: https://github.com/seemethere, https://github.com/eellison	2022-05-18 11:48:50 -07:00
mikeiovine	02713221e3	[SR] Fuse clamp/nan_to_num Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094 Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor. Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)! Approved by: https://github.com/navahgar	2022-05-10 23:33:59 +00:00
Mike Iovine	849984a2cd	[SR] Sigmoid out variant calls fast_sigmoid (#75661 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661 `fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation). ghstack-source-id: 155604086 Reviewed By: navahgar, hlu1 Differential Revision: D35481390 fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1 (cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)	2022-05-06 00:14:30 +00:00
Mike Iovine	1fed6b7559	[SR] Eliminate extra permutes around softmax calls (#76391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391 I've seen this pattern in many important internal models: ``` x = torch.permute(a, [0, 2, 1]) y = torch.softmax(x, 2) z = torch.permute(y, [0, 2, 1]) ``` This is equivalent to ``` z = torch.softmax(x, 1) ``` The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this. ghstack-source-id: 155466506 Test Plan: New unit tests Reviewed By: navahgar, huiguoo Differential Revision: D35938289 fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9 (cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)	2022-05-04 23:08:49 +00:00
Mike Iovine	cac2733af1	[SR] Codegen for aten::clamp (#76340 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340 NNC kernel for `clamp` scalar case ghstack-source-id: 155466507 Reviewed By: navahgar, huiguoo Differential Revision: D35904019 fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f (cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)	2022-05-04 23:08:49 +00:00
Wang, Eikan	429a80dded	[NNC] Lowering function generates the output buffer with the specified stride (#76529 ) Summary: Pass stride information to lowering function to generate the output bufer with proper memory layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529 Reviewed By: ZolotukhinM Differential Revision: D36116712 Pulled By: IvanKobzarev fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929 (cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)	2022-05-04 20:04:22 +00:00
Hui Guo	bcddd4ab3e	[Static Runtime] Add auto generated unstructured ops (#76398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76398 This diff adds the large files that include the newly generated ops from D34913736. Refer to the base diff for more details. Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: mikeiovine, tenpercent Differential Revision: D35945633 fbshipit-source-id: 53497bd5c490a57ea1521837762f740deb42bfd8 (cherry picked from commit e0fbdcb0bf09f5c192430f95f450c0a946c80074)	2022-05-04 19:34:19 +00:00
Mike Iovine	fc64dbdc01	[SR] Fuse quantized linear/relu (#75775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775 fbgemm kernels already implement the fused kernel, no reason not to use it ghstack-source-id: 155450342 Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35633297 fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48 (cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)	2022-05-04 19:01:14 +00:00
Michael Suo	fb0f285638	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 20:51:34 +00:00
PyTorch MergeBot	3d7428d9ac	Revert "[lint] upgrade mypy to latest version" This reverts commit `9bf18aab94`. Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo	2022-05-03 20:01:18 +00:00
Michael Suo	9bf18aab94	[lint] upgrade mypy to latest version Fixes https://github.com/pytorch/pytorch/issues/75927. Had to fix some bugs and add some ignores. To check if clean: ``` lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753 Approved by: https://github.com/malfet	2022-05-03 19:43:28 +00:00
Mike Iovine	b02b3f25db	[SR] Quick hack to eliminate no-op slice (#75774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774 `list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models. Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35632947 fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d (cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)	2022-05-03 19:29:46 +00:00
Mike Iovine	3fa77fa51a	[SR] Fix quantized linear tests not managing outputs (#75776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75776 The output was returned directly instead of a clone, so the output of the relevant op would not be managed. ghstack-source-id: 154935103 Test Plan: CI Reviewed By: navahgar Differential Revision: D35633469 fbshipit-source-id: 7b08b7368e0349a12abf8802a4c625ffecdc5abb (cherry picked from commit 24bed9ba4da39cff7f3b40f5e49dfded2552b373)	2022-04-27 16:38:54 +00:00
Aaron Enye Shi	09a5b075fe	[libkineto] Re-enable user-annotations in PyTorch (#75601 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75601 User annotations was previously pushed down to the GPU timelines but was disabled during a refactoring some time back. This patch re-enables it in PyTorch Profiler. Test Plan: CI Tests Reviewed By: chaekit Differential Revision: D34591916 Pulled By: aaronenyeshi fbshipit-source-id: 3f4d5327b391725f4ce4e3eb16740bac2cd1c618 (cherry picked from commit 4bc07174dfef8fb2ffbefba224773a4618ed203a)	2022-04-26 23:54:22 +00:00
zengk95	1d55518198	Revert "[nnc] Strides to Tensor (#72962 )" This reverts commit `939060925f`. Fixes https://github.com/pytorch/vision/issues/5873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332 Approved by: https://github.com/seemethere	2022-04-25 19:50:00 +00:00
Ivan Kobzarev	939060925f	[nnc] Strides to Tensor (#72962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962 Test Plan: Imported from OSS Reviewed By: ZolotukhinM, cpuhrsch Differential Revision: D34589306 Pulled By: IvanKobzarev fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944 (cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)	2022-04-23 19:35:15 +00:00
Ansha Yu	ee636e2fd1	[sr] remove max_indices argument of embedding_bag when unncessary (#75993 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993 Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0 {F723683014} More details in https://fb.quip.com/MKumAjz1YD4 (`1f47a80e88`)a#temp:C:FPD3 (`ecd5567980`)e5a0871ae5d481286b511ef7 The last 3 outputs of embedding_bag are unused in the graph: P495814049. * max_indices output isn't necessary for the main output, so remove it when it's not used in the graph. * offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph. * bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now. Test Plan: `./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only` Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604` Before: I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78 0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant) After: I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2 0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant) * Ads prod canary: https://www.internalfb.com/intern/ads/canary/443002539593035806/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594` https://www.internalfb.com/intern/servicelab/602875732/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594` https://www.internalfb.com/intern/servicelab/1002874745/ Reviewed By: mikeiovine Differential Revision: D35726594 fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a (cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)	2022-04-22 15:36:35 +00:00
vfdev-5	6593d293f7	Added functorch to functional_autograd_benchmark Description: - Following https://github.com/pytorch/functorch/issues/497 adding an option to run benchmarks with functorch and compare to original functional autograd results. Running the benchmark we get below table: <details> <summary> Table </summary> ``` \| model \| task \| mean \| var \| \| -- \| -- \| -- \| -- \| \| resnet18 \| vjp \| 0.03826599195599556 \| 4.3332115637895186e-06 \| \| resnet18 \| functorch vjp \| 0.037201929837465286 \| 6.139693198292662e-09 \| \| resnet18 \| vhp \| 0.2202976644039154 \| 2.8687209052691287e-08 \| \| resnet18 \| functorch vhp \| 0.22117868065834045 \| 4.108771278765744e-08 \| \| resnet18 \| jvp \| 0.18679651618003845 \| 1.832455254202614e-08 \| \| resnet18 \| functorch jvp \| 0.05305683612823486 \| 1.6690266946284282e-08 \| \| fcn_resnet \| vjp \| 0.6071907877922058 \| 7.436695454998699e-07 \| \| fcn_resnet \| functorch vjp \| 0.6115708947181702 \| 1.121692207561864e-06 \| \| fcn_resnet \| vhp \| 3.419469118118286 \| 0.020633839070796967 \| \| fcn_resnet \| jvp \| 2.5421929359436035 \| 3.1765587209520163e-06 \| \| fcn_resnet \| functorch jvp \| 0.7628333568572998 \| 1.4555752159139956e-07 \| \| detr \| vjp \| 0.19494840502738953 \| 1.9122715457342565e-05 \| \| detr \| vhp \| 1.1664292812347412 \| 0.000948643428273499 \| \| detr \| jvp \| 0.9990308880805969 \| 1.0214127541985363e-05 \| \| ppl_simple_reg \| vjp \| 0.0007535457843914628 \| 6.024204690646684e-09 \| \| ppl_simple_reg \| functorch vjp \| 0.0016954183811321855 \| 1.160151974488599e-08 \| \| ppl_simple_reg \| vhp \| 0.0011888503795489669 \| 5.93119386937957e-10 \| \| ppl_simple_reg \| functorch vhp \| 0.0026826143730431795 \| 1.6787025103326414e-08 \| \| ppl_simple_reg \| jvp \| 0.001067900680936873 \| 7.409912128331086e-10 \| \| ppl_simple_reg \| functorch jvp \| 0.002065300941467285 \| 9.710328185974504e-08 \| \| ppl_simple_reg \| hvp \| 0.001212477684020996 \| 1.974137298077494e-09 \| \| ppl_simple_reg \| functorch hvp \| 0.00482442369684577 \| 2.327668653379078e-07 \| \| ppl_simple_reg \| jacobian \| 0.0009108781814575195 \| 3.489469158068914e-09 \| \| ppl_simple_reg \| functorch jacobian \| 0.0019866942893713713 \| 1.938326299466553e-08 \| \| ppl_simple_reg \| hessian \| 0.005053090862929821 \| 3.370298600202659e-07 \| \| ppl_simple_reg \| functorch hessian \| 0.006374978926032782 \| 7.556796077778927e-08 \| \| ppl_simple_reg \| hessian_fwdrev \| 0.0036706924438476562 \| 1.996075527088692e-09 \| \| ppl_simple_reg \| functorch hessian_fwdrev \| 0.0058908225037157536 \| 7.548283775804521e-08 \| \| ppl_simple_reg \| hessian_revrev \| 0.0015769004821777344 \| 1.5754418214442012e-08 \| \| ppl_simple_reg \| functorch hessian_revrev \| 0.0041002752259373665 \| 6.713568723171193e-08 \| \| ppl_simple_reg \| jacfwd \| 0.0018048763740807772 \| 2.7375660849315864e-08 \| \| ppl_simple_reg \| functorch jacfwd \| 0.002047991845756769 \| 2.432247070416338e-09 \| \| ppl_simple_reg \| jacrev \| 0.0009733677143231034 \| 1.0078769818733235e-08 \| \| ppl_simple_reg \| functorch jacrev \| 0.0021971464157104492 \| 1.2729884701911942e-08 \| \| ppl_robust_reg \| vjp \| 0.005820560269057751 \| 8.582588151284654e-08 \| \| ppl_robust_reg \| functorch vjp \| 0.00796132069081068 \| 9.663100541956737e-09 \| \| ppl_robust_reg \| vhp \| 0.009825301356613636 \| 2.0081762386325863e-07 \| \| ppl_robust_reg \| functorch vhp \| 0.014890861697494984 \| 4.558066279969353e-07 \| \| ppl_robust_reg \| jvp \| 0.008297419175505638 \| 2.9454400873873965e-07 \| \| ppl_robust_reg \| functorch jvp \| 0.008052706718444824 \| 7.120377176761394e-08 \| \| ppl_robust_reg \| hvp \| 0.015414690598845482 \| 7.42123745567369e-07 \| \| ppl_robust_reg \| functorch hvp \| 0.02699306048452854 \| 1.4650488537881756e-06 \| \| ppl_robust_reg \| jacobian \| 0.006207776255905628 \| 1.7068457225377642e-07 \| \| ppl_robust_reg \| functorch jacobian \| 0.009173822589218616 \| 1.2214455580306094e-07 \| \| ppl_robust_reg \| hessian \| 0.04670915752649307 \| 1.4299343092716299e-05 \| \| ppl_robust_reg \| functorch hessian \| 0.02337808534502983 \| 3.0397418413485866e-06 \| \| ppl_robust_reg \| hessian_fwdrev \| 0.024229884147644043 \| 2.0425247839739313e-06 \| \| ppl_robust_reg \| functorch hessian_fwdrev \| 0.022021746262907982 \| 3.512146236062108e-07 \| \| ppl_robust_reg \| hessian_revrev \| 0.012355780228972435 \| 7.090877147675201e-07 \| \| ppl_robust_reg \| functorch hessian_revrev \| 0.013960313983261585 \| 6.326549737423193e-07 \| \| ppl_robust_reg \| jacfwd \| 0.008112502284348011 \| 2.88503088086145e-08 \| \| ppl_robust_reg \| functorch jacfwd \| 0.008947920985519886 \| 4.2070990247111695e-08 \| \| ppl_robust_reg \| jacrev \| 0.00635871896520257 \| 1.3403841592207755e-07 \| \| ppl_robust_reg \| functorch jacrev \| 0.009123563766479492 \| 2.677554675756255e-07 \| \| wav2letter \| vjp \| 0.02078995667397976 \| 2.1110793113621185e-06 \| \| wav2letter \| functorch vjp \| 0.019202351570129395 \| 9.210506135559626e-09 \| \| wav2letter \| vhp \| 0.05997290462255478 \| 8.558587616391833e-09 \| \| wav2letter \| functorch vhp \| 0.06035261228680611 \| 1.6448565842708263e-09 \| \| wav2letter \| jvp \| 0.04507789760828018 \| 1.5771547401399744e-09 \| \| wav2letter \| functorch jvp \| 0.013057494536042213 \| 3.804750292601966e-09 \| \| deepspeech \| vjp \| 0.3648746609687805 \| 1.5359055396402255e-05 \| \| transformer \| vjp \| 0.05496881157159805 \| 1.242562319703211e-08 \| \| transformer \| functorch vjp \| 0.057835936546325684 \| 2.6113376350167528e-08 \| \| transformer \| vhp \| 0.18313491344451904 \| 7.226336151688884e-08 \| \| transformer \| jvp \| 0.13924935460090637 \| 1.6989159234981344e-07 \| \| multiheadattn \| vjp \| 0.0014708995586261153 \| 3.710916729460223e-08 \| \| multiheadattn \| functorch vjp \| 0.002404856728389859 \| 2.1910574687922235e-08 \| \| multiheadattn \| vhp \| 0.003382015274837613 \| 5.3098595742540056e-08 \| \| multiheadattn \| functorch vhp \| 0.005340623669326305 \| 5.897558708056749e-08 \| \| multiheadattn \| jvp \| 0.0027526854537427425 \| 3.508620949332908e-08 \| \| multiheadattn \| functorch jvp \| 0.0022981404326856136 \| 1.327894807445773e-07 \| ``` </details> <details> <summary> Stdout </summary> ``` Found functorch: 0.2.0a0+386a541 Results for model resnet18 on task vjp: 0.03826599195599556s (var: 4.3332115637895186e-06) Results for model resnet18 on task vjp using Functorch: 0.037201929837465286s (var: 6.139693198292662e-09) Results for model resnet18 on task vhp: 0.2202976644039154s (var: 2.8687209052691287e-08) Results for model resnet18 on task vhp using Functorch: 0.22117868065834045s (var: 4.108771278765744e-08) Results for model resnet18 on task jvp: 0.18679651618003845s (var: 1.832455254202614e-08) Results for model resnet18 on task jvp using Functorch: 0.05305683612823486s (var: 1.6690266946284282e-08) Results for model fcn_resnet on task vjp: 0.6071907877922058s (var: 7.436695454998699e-07) Results for model fcn_resnet on task vjp using Functorch: 0.6115708947181702s (var: 1.121692207561864e-06) Results for model fcn_resnet on task vhp: 3.419469118118286s (var: 0.020633839070796967) Failed model using Functorch: fcn_resnet, task: vhp, Error message: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 45.62 GiB already allocated; 5.31 MiB free; 46.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Results for model fcn_resnet on task jvp: 2.5421929359436035s (var: 3.1765587209520163e-06) Results for model fcn_resnet on task jvp using Functorch: 0.7628333568572998s (var: 1.4555752159139956e-07) Results for model detr on task vjp: 0.19494840502738953s (var: 1.9122715457342565e-05) Failed model using Functorch: detr, task: vjp, Error message: Cannot access data pointer of Tensor that doesn't have storage Results for model detr on task vhp: 1.1664292812347412s (var: 0.000948643428273499) Failed model using Functorch: detr, task: vhp, Error message: Cannot access data pointer of Tensor that doesn't have storage Results for model detr on task jvp: 0.9990308880805969s (var: 1.0214127541985363e-05) Failed model using Functorch: detr, task: jvp, Error message: Trying to use forward AD with _cdist_forward that does not support it because it has not been implemented yet. Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation. Results for model ppl_simple_reg on task vjp: 0.0007535457843914628s (var: 6.024204690646684e-09) Results for model ppl_simple_reg on task vjp using Functorch: 0.0016954183811321855s (var: 1.160151974488599e-08) Results for model ppl_simple_reg on task vhp: 0.0011888503795489669s (var: 5.93119386937957e-10) Results for model ppl_simple_reg on task vhp using Functorch: 0.0026826143730431795s (var: 1.6787025103326414e-08) Results for model ppl_simple_reg on task jvp: 0.001067900680936873s (var: 7.409912128331086e-10) Results for model ppl_simple_reg on task jvp using Functorch: 0.002065300941467285s (var: 9.710328185974504e-08) Results for model ppl_simple_reg on task hvp: 0.001212477684020996s (var: 1.974137298077494e-09) Results for model ppl_simple_reg on task hvp using Functorch: 0.00482442369684577s (var: 2.327668653379078e-07) Results for model ppl_simple_reg on task jacobian: 0.0009108781814575195s (var: 3.489469158068914e-09) Results for model ppl_simple_reg on task jacobian using Functorch: 0.0019866942893713713s (var: 1.938326299466553e-08) Results for model ppl_simple_reg on task hessian: 0.005053090862929821s (var: 3.370298600202659e-07) Results for model ppl_simple_reg on task hessian using Functorch: 0.006374978926032782s (var: 7.556796077778927e-08) Results for model ppl_simple_reg on task hessian_fwdrev: 0.0036706924438476562s (var: 1.996075527088692e-09) Results for model ppl_simple_reg on task hessian_fwdrev using Functorch: 0.0058908225037157536s (var: 7.548283775804521e-08) Results for model ppl_simple_reg on task hessian_revrev: 0.0015769004821777344s (var: 1.5754418214442012e-08) Results for model ppl_simple_reg on task hessian_revrev using Functorch: 0.0041002752259373665s (var: 6.713568723171193e-08) Results for model ppl_simple_reg on task jacfwd: 0.0018048763740807772s (var: 2.7375660849315864e-08) Results for model ppl_simple_reg on task jacfwd using Functorch: 0.002047991845756769s (var: 2.432247070416338e-09) Results for model ppl_simple_reg on task jacrev: 0.0009733677143231034s (var: 1.0078769818733235e-08) Results for model ppl_simple_reg on task jacrev using Functorch: 0.0021971464157104492s (var: 1.2729884701911942e-08) Results for model ppl_robust_reg on task vjp: 0.005820560269057751s (var: 8.582588151284654e-08) Results for model ppl_robust_reg on task vjp using Functorch: 0.00796132069081068s (var: 9.663100541956737e-09) Results for model ppl_robust_reg on task vhp: 0.009825301356613636s (var: 2.0081762386325863e-07) Results for model ppl_robust_reg on task vhp using Functorch: 0.014890861697494984s (var: 4.558066279969353e-07) Results for model ppl_robust_reg on task jvp: 0.008297419175505638s (var: 2.9454400873873965e-07) Results for model ppl_robust_reg on task jvp using Functorch: 0.008052706718444824s (var: 7.120377176761394e-08) Results for model ppl_robust_reg on task hvp: 0.015414690598845482s (var: 7.42123745567369e-07) Results for model ppl_robust_reg on task hvp using Functorch: 0.02699306048452854s (var: 1.4650488537881756e-06) Results for model ppl_robust_reg on task jacobian: 0.006207776255905628s (var: 1.7068457225377642e-07) Results for model ppl_robust_reg on task jacobian using Functorch: 0.009173822589218616s (var: 1.2214455580306094e-07) Results for model ppl_robust_reg on task hessian: 0.04670915752649307s (var: 1.4299343092716299e-05) Results for model ppl_robust_reg on task hessian using Functorch: 0.02337808534502983s (var: 3.0397418413485866e-06) Results for model ppl_robust_reg on task hessian_fwdrev: 0.024229884147644043s (var: 2.0425247839739313e-06) Results for model ppl_robust_reg on task hessian_fwdrev using Functorch: 0.022021746262907982s (var: 3.512146236062108e-07) Results for model ppl_robust_reg on task hessian_revrev: 0.012355780228972435s (var: 7.090877147675201e-07) Results for model ppl_robust_reg on task hessian_revrev using Functorch: 0.013960313983261585s (var: 6.326549737423193e-07) Results for model ppl_robust_reg on task jacfwd: 0.008112502284348011s (var: 2.88503088086145e-08) Results for model ppl_robust_reg on task jacfwd using Functorch: 0.008947920985519886s (var: 4.2070990247111695e-08) Results for model ppl_robust_reg on task jacrev: 0.00635871896520257s (var: 1.3403841592207755e-07) Results for model ppl_robust_reg on task jacrev using Functorch: 0.009123563766479492s (var: 2.677554675756255e-07) Results for model wav2letter on task vjp: 0.02078995667397976s (var: 2.1110793113621185e-06) Results for model wav2letter on task vjp using Functorch: 0.019202351570129395s (var: 9.210506135559626e-09) Results for model wav2letter on task vhp: 0.05997290462255478s (var: 8.558587616391833e-09) Results for model wav2letter on task vhp using Functorch: 0.06035261228680611s (var: 1.6448565842708263e-09) Results for model wav2letter on task jvp: 0.04507789760828018s (var: 1.5771547401399744e-09) Results for model wav2letter on task jvp using Functorch: 0.013057494536042213s (var: 3.804750292601966e-09) Results for model deepspeech on task vjp: 0.3648746609687805s (var: 1.5359055396402255e-05) Failed model using Functorch: deepspeech, task: vjp, Error message: Cannot access storage of TensorWrapper Results for model transformer on task vjp: 0.05496881157159805s (var: 1.242562319703211e-08) Results for model transformer on task vjp using Functorch: 0.057835936546325684s (var: 2.6113376350167528e-08) Results for model transformer on task vhp: 0.18313491344451904s (var: 7.226336151688884e-08) Failed model using Functorch: transformer, task: vhp, Error message: bad optional access Results for model transformer on task jvp: 0.13924935460090637s (var: 1.6989159234981344e-07) Failed model using Functorch: transformer, task: jvp, Error message: Trying to use forward AD with embedding that does not support it because it has not been implemented yet. Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation. Results for model multiheadattn on task vjp: 0.0014708995586261153s (var: 3.710916729460223e-08) Results for model multiheadattn on task vjp using Functorch: 0.002404856728389859s (var: 2.1910574687922235e-08) Results for model multiheadattn on task vhp: 0.003382015274837613s (var: 5.3098595742540056e-08) Results for model multiheadattn on task vhp using Functorch: 0.005340623669326305s (var: 5.897558708056749e-08) Results for model multiheadattn on task jvp: 0.0027526854537427425s (var: 3.508620949332908e-08) Results for model multiheadattn on task jvp using Functorch: 0.0022981404326856136s (var: 1.327894807445773e-07) ``` </details> All functorch errors are reported in its repository. cc @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/75689 Approved by: https://github.com/zou3519	2022-04-22 14:04:26 +00:00
Mike Iovine	b6a4234090	[SR] Fix broken unit test build (#76111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76111 https://github.com/pytorch/pytorch/pull/68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this ghstack-source-id: 154335722 Test Plan: CI Reviewed By: navahgar, ajyu Differential Revision: D35780296 fbshipit-source-id: 0a262eb06a8d619227e5db10b6a775bf0b2e17c1 (cherry picked from commit aea6fbf9365391011df5211164e3978075d7a5cb)	2022-04-20 18:36:31 +00:00
mikeiovine	98b4a4100d	[SR] Add a copy variant for fused_split_and_squeeze Pull Request resolved: https://github.com/pytorch/pytorch/pull/75660 The outputs of `split_and_squeeze` are passed to `VarStack` in models we care about. `VarStack` has a [fast path](https://www.internalfb.com/code/fbsource/[893193f5277184fd17f4ea3f28fe415a4df37707]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=296-298) for when all of its inputs have the same strides. Hitting the slow path adds a ton of extra overhead - so much that it's worth it to copy in `split_and_squeeze` and force all of `VarStack`'s inputs to be contiguous so we can take advantage of the fast path. Differential Revision: [D35513777](https://our.internmc.facebook.com/intern/diff/D35513777/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35513777/)! Approved by: https://github.com/hlu1	2022-04-13 20:02:01 +00:00
Yulv-git	ac2d2e3a3d	Fix some typos. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561 Approved by: https://github.com/albanD	2022-04-11 21:55:59 +00:00
Nikita Shulga	80ea6955af	Add cuda-11.3+clang9 build workflow (take 2) To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 17:13:01 +00:00
PyTorch MergeBot	8fe43d76d5	Revert "Add cuda-11.3+clang9 build workflow" This reverts commit `709fcc862e`. Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99	2022-04-11 15:24:59 +00:00
Nikita Shulga	709fcc862e	Add cuda-11.3+clang9 build workflow To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic) Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu` Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293 Approved by: https://github.com/atalman, https://github.com/seemethere	2022-04-11 14:10:57 +00:00
Mike Iovine	2f98fa9147	[SR] Do not manage tensors that escape scope via container (#74966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966 It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this: ``` def forward(self, cond: bool, a, b): lst = [] if cond: res = a + b # res should not be managed!!! lst.append(res) return lst ``` The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`. The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`. This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy. ghstack-source-id: 153288188 Test Plan: New unit tests cover the problematic case Reviewed By: navahgar Differential Revision: D35257087 fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7 (cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)	2022-04-07 19:57:57 +00:00
Mike Iovine	4055d1f653	[SR] Fix StaticRuntime move ctor (#74927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927 The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated. Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue. ghstack-source-id: 153168602 Test Plan: New unit test/CI Reviewed By: navahgar Differential Revision: D35228467 fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3 (cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)	2022-04-07 02:16:37 +00:00
Don Jang	85e163c56b	[Static Runtime] Fix a bug that `aten::full_like` reuses a tensor that does not match arguments (#74255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255 This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed. Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path. Reviewed By: mikeiovine Differential Revision: D34863639 fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608 (cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)	2022-04-05 22:30:41 +00:00
Raghavan Raman	60bda4d06b	[Static Runtime] Fix handling relu in quantized linear relu dynamic op Summary: The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter. At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue. This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16 ``` Reviewed By: mikeiovine Differential Revision: D35366309 fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9 (cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)	2022-04-04 22:16:22 +00:00
Max Podkorytov	11c412a8ec	[static-runtime] optimize empty if blocks at runtime (#74987 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987 Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty Test Plan: `buck build //caffe2:torch-cpp-cpu` `buck test //caffe2/benchmarks/static_runtime/...` Add unit test: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock` Reviewed By: mikeiovine Differential Revision: D35262952 fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2 (cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)	2022-04-01 05:43:33 +00:00
Norman Ponte	2e8b9c7785	[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)	2022-04-01 00:35:42 +00:00
jjsjann123	873ced7cd0	Nvfuser code bump 030122 (#73627 ) Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)	2022-03-31 08:18:22 +00:00
Mike Iovine	2ca66ffb7d	[SR] Force split_and_squeeze usage via graph transformation (#74274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274 Reviewed By: navahgar Differential Revision: D34913889 fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458 (cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)	2022-03-29 19:13:40 +00:00
Elias Ellison	6694fdaccd	Clean up profiling mode and profiling executor strategy (#73875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875 Previously we had a few settings: - getExecutor - which toggled between Profiling Executor and Legacy - getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations) and then... - getProfilingMode - which would set PE to 0 specializtions. The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93. The tests here are failing but get fixed with the PR above it, so i'll squash for landing. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D34938130 Pulled By: eellison fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b (cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)	2022-03-29 18:38:51 +00:00
Mike Iovine	3f37337ed0	[SR] Native implementation for reshape_as (#74585 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585 Native static runtime for `aten::reshape_as` ghstack-source-id: 152340038 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35060895 fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701 (cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)	2022-03-28 17:02:14 +00:00
Mike Iovine	9f2344aa40	[SR] Native implementation for select (#74568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568 Native static runtime implementation for `aten::select(Tensor, int, int)` overload ghstack-source-id: 152340037 Test Plan: New unit test Reviewed By: hlu1 Differential Revision: D35053900 fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f (cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)	2022-03-28 17:02:14 +00:00
Mike Iovine	facdbe6d72	[SR] Native implementation for IntImplicit (#74562 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562 Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do ghstack-source-id: 152340039 Test Plan: New unit tests Reviewed By: hlu1 Differential Revision: D35052997 fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42 (cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)	2022-03-28 17:02:14 +00:00
Mike Iovine	f5a9c36d0b	[SR] Eliminate extra permute ops before `aten::sum` (#74481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f (cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)	2022-03-23 23:00:14 +00:00
Don Jang	6294a2eb7f	[Static Runtime] Add out variant wrapper for aten::index_select (#74321 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321 This change adds out variant wrapper for aten::index_select. Test Plan: Added a unittest Reviewed By: mikeiovine Differential Revision: D34928012 fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283 (cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)	2022-03-16 23:43:21 +00:00
Mike Iovine	f14a0be302	[SR] Avoid allocating rstd/mean in layer_norm (#73606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606 The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely. ghstack-source-id: 151394116 Test Plan: Existing unit tests Reviewed By: hlu1 Differential Revision: D34562131 fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de (cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)	2022-03-15 22:07:11 +00:00
Don Jang	381c0c080f	[Static Runtime] Fix a bug that `aten::full` reuses a tensor that does not match requested one (#73990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990 This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed. This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following. Test Plan: - Added a unittest. Reviewed By: mikeiovine Differential Revision: D34768718 fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9 (cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)	2022-03-15 16:13:52 +00:00
Don Jang	1b80f609b0	[Static Runtime] Add out variant wrapper for aten::ones_like (#73945 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945 This change adds add out variant wrapper for aten::ones_like. Test Plan: - Added a unittest. - Checked that the op execution got switched to its added out variant (P485330978). Reviewed By: hlu1 Differential Revision: D34727057 fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5 (cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)	2022-03-14 20:29:58 +00:00
Don Jang	60f22a40ef	[Static Runtime] Add out variant wrapper for aten::zeros (#73946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946 This change adds an out variant wrapper for aten::zeros. Test Plan: - Added a unittest. - Confirmed that the added out variant gets executed by the unittest (P485324923). Reviewed By: mikeiovine Differential Revision: D34725843 fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed (cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)	2022-03-11 00:52:30 +00:00
Don Jang	87564a1bd7	[Static Runtime] Add native op support for `aten::len` (#73899 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73899 This change adds native op wrappers to Static Runtime as appears in JIT (https://www.internalfb.com/code/fbsource/[429d233b9beb5e6f60df7304b792e2ff332f6ecd]/fbcode/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=613 , search for "aten::len" in that file). Test Plan: Added unittests, "StaticRuntime.LenWith*", and confirmed they are passing with `V0307 17:39:39.817956 3516654 impl.cpp:1792] Switch to native impl for node: %2 : int = aten::len(%input.1)` per added unittest: P485159811 Reviewed By: mikeiovine Differential Revision: D34705231 fbshipit-source-id: 916b1f8bdbc92def07bc3f98ce1db22f0f5ce206 (cherry picked from commit 66d2bb9a0a294b55e1bc87ae33f5553b1460e74b)	2022-03-10 02:57:51 +00:00
Mike Iovine	97b20b9b50	[SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704 Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message. ghstack-source-id: 150812721 Test Plan: New unit tests Reviewed By: hlu1 Differential Revision: D34596954 fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14 (cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)	2022-03-09 00:55:57 +00:00
Sergii Dymchenko	5b011fc6eb	Fix Undefined variable in QInterpolateBenchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130 Approved by: https://github.com/malfet	2022-03-09 00:14:15 +00:00
Don Jang	71961d37bb	[Static Runtime] Add out variant wrapper for aten::ones (#73851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851 This change adds an out variant wrapper for aten::ones Test Plan: Added a unittest Reviewed By: mikeiovine Differential Revision: D34557095 fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae (cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)	2022-03-07 20:33:22 +00:00
Mike Iovine	818bf361b6	[SR] Fix a kwargs API default value bug (#73681 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681 Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values. ghstack-source-id: 150433627 Test Plan: Added unit test to cover this case Reviewed By: navahgar, d1jang Differential Revision: D34588804 fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755 (cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)	2022-03-03 22:31:37 +00:00
Don Jang	bbc59ff2bf	[Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536 Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious: 1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule` 2) fully instanced ones that are owned by `StaticRuntime`. We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created. To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2). Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D33985600 fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c (cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)	2022-03-02 22:33:32 +00:00
Don Jang	539acb29cd	[Static Runtime] Fix a broken test & Add an out variant wrapper for `mse_loss` (#73574 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73574 T113070663 identified a test breakage in `StaticRuntime.autogen__convert_indices_from_csr_to_coo` using `mode/opt`. This change fixes it by using a correct test value. By generating the out variants for Static Runtime, this change also includes an out variant wrapper for `mse_loss`. Generating out variants for Static Runtime ``` [djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck run //caffe2/torch/fb/jit:gen_static_runtime_ops Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds. Changed value //fbcode.sanitizer='address-undefined-dev' (was 'thread') ... and 13 more. See logs for all changes Parsing buck files: finished in 0.8 sec Downloaded 14/25 artifacts, 159.45 Kbytes, 30.0% cache miss (for updated rules) Building: finished in 1.6 sec (100%) 52/52 jobs, 35/52 updated Total time: 2.5 sec BUILD SUCCEEDED total grouped native ops: 1501 structured grouped native ops: 540 generated grouped native ops: 137 ``` Test Plan: Ran the broken test in `mode/opt` and confirmed that it passes now. ``` [djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck test mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --exact 'caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo' --run-disabled Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds. Changed value //project.buck_out='buck-out/opt' (was 'buck-out/dev') ... and 307 more. See logs for all changes DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/toolchains/platform009/build/mesa/lib/libGL.so' in library rule 'fbsource//third-party/toolchains:opengl' DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/freeglut/3.0.0/libs/x64-linux/libglut.a' in library rule 'fbsource//third-party/toolchains:GLUT' I0301 08:28:08.884272 2239319 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000 I0301 08:30:14.751745 2261718 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000 Parsing buck files: finished in 10.1 sec Creating action graph: finished in 6.1 sec [RE] Metadata: Session ID=[https://fburl.com/b/reSessionID-fa0ba93b-33a1-4e6f-88f8-9f508d2c27c3] [RE] Waiting on 0 remote actions. Completed 247 actions remotely, action cache hit rate: 0.00%. Downloaded 13000/17457 artifacts, 463.99 Mbytes, 2.6% cache miss (for updated rules) Building: finished in 04:16.6 min (100%) 28628/28628 jobs, 28628/28628 updated Total time: 04:32.9 min More details at https://www.internalfb.com/intern/buck/build/c774ff43-5311-49ce-a677-30e3f6afdad1 BUILD SUCCEEDED Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details. Running with tpx session id: 16d9b24c-4a63-4671-84b5-690fac0ee086 Trace available for this run at /tmp/tpx-20220301-083049.472831-16d9b24c-4a63-4671-84b5-690fac0ee086/trace.log RemoteExecution session id: reSessionID-16d9b24c-4a63-4671-84b5-690fac0ee086-tpx Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685 ✓ ListingSuccess: caffe2/benchmarks/static_runtime:static_runtime_cpptest : 285 tests discovered (0.425) ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo (0.105) Summary Pass: 1 ListingSuccess: 1 If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685 ``` Reviewed By: mikeiovine Differential Revision: D34552645 fbshipit-source-id: 36f15b0f29edcb7deb71ba8a6f66ce2532bf7c82 (cherry picked from commit 2329afd8bfc89671cfbd864414e528241e7045fc)	2022-03-02 04:36:31 +00:00
Raghavan Raman	cfd92f2d59	[Static Runtime] Add test that runs NNC fused kernels in parallel (#73256 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73256 This adds a test that executes multiple Static Runtime instances in parallel when each instances includes a fusion. ghstack-source-id: 149787403 Test Plan: ``` buck run mode/dev-asan //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=CpuFusion.ParallelRuntimes ``` The above test results in an error: P482317015 (when parts of the fix in D34287960 (`6d33852685`) are backed out) Reviewed By: mikeiovine Differential Revision: D34404127 fbshipit-source-id: 95a267e27d74584df90841fe496f909171136981 (cherry picked from commit 57d3ad9a46a24559f6d4f4097bd1b8e0b1f6b077)	2022-02-28 17:44:45 +00:00
Don Jang	fe7e1bd1ce	[Static Runtime] Add auto-generated out variant dispatchers (#72603 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72603 This change adds out variant dispatchers generated by the previous diff. The number of the out variant dispatchers generated by this diff is 133, which increases the out variant coverage by 309% (current: 43, this diff: 133 + 43 = 176). This number is expected to increase a lot as we develop this script further to cover more ops. Test Plan: Unittest Confirmed ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` is passing. Reviewed By: swolchok Differential Revision: D33373928 fbshipit-source-id: 4d94d788282f3f313bb36f2f9452edecd9862246 (cherry picked from commit e4ce8b386d1fcc47b86cb9c9016a70e7a31b452c)	2022-02-28 08:39:10 +00:00
Mike Iovine	d398d4d32c	[SR] Disable aten::where out variant (#73367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367 The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others: ``` def forward(self, cond_1d, x, y): shape = [-1] + [1] * (x.dim() - 1) cond = cond_1d.view(shape) return torch.where(cond, x, y).clone() Condition: 01 00 [ CPUBoolType{2} ] A: 06 -9 08 -8 [ CPULongType{2,2} ] B: -4 05 -5 -2 [ CPULongType{2,2} ] Actual: 06 05 -5 -2 [ CPULongType{2,2} ] Expected: 06 -9 -5 -2 [ CPULongType{2,2} ] ``` ghstack-source-id: 149963254 Test Plan: Unit tests exercise broadcasting Reviewed By: d1jang Differential Revision: D34454770 fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4 (cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)	2022-02-26 01:08:45 +00:00
Raghavan Raman	4838c6dca0	[Static Runtime] Enable all tests to run with TensorExpr fuser (#73263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73263 ghstack-source-id: 149784887 Test Plan: ``` buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: d1jang Differential Revision: D34405943 fbshipit-source-id: 63e345be4bf7a57bf4e1446074e5112d4ed68515 (cherry picked from commit 69a28a6b4ea53bcd88a51c4c36d5205577d84da3)	2022-02-24 00:34:34 +00:00
Raghavan Raman	02afdd54b9	[Static Runtime] Handle fallback graphs that are generated as part of the TE Fuser (#72945 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72945 ghstack-source-id: 149429754 Test Plan: ``` buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest — --gtest_filter=CpuFusion.FallbackGraph ``` Reviewed By: mikeiovine Differential Revision: D34283840 fbshipit-source-id: 868bd340a50fe691797164524f2400d07998d304 (cherry picked from commit `80f60f2cc0`)	2022-02-18 18:34:50 +00:00
Mike Iovine	d1c5f9e439	[JIT][SR] Introduce prim::IfThenElse (#72587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587 This pattern frequently appears in a few graphs: ``` %result = prim::If(%condition) block0(): -> (%a) block1(): -> (%b) ``` This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation. This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator: ``` %result = prim::IfThenElse(%condition, %a, %b) ``` Test Plan: New unit tests Reviewed By: eellison Differential Revision: D34091789 fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c (cherry picked from commit `0f1b335e5b`)	2022-02-17 18:22:48 +00:00
Sergii Dymchenko	486572223b	Fix command example (#72847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847 Reviewed By: malfet Differential Revision: D34260868 Pulled By: kit1980 fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7 (cherry picked from commit `c9e874c4d8`)	2022-02-16 21:45:45 +00:00
Mike Iovine	d2c0c0b638	[SR] Apply all graph passes to sub-blocks (#72598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72598 Apply all optimizations to sub-blocks by replacing loops over `graph->nodes()` with loops over nodes in `DepthFirstGraphNodeIterator` ghstack-source-id: 149155700 Test Plan: Existing unit tests Reviewed By: d1jang Differential Revision: D34111430 fbshipit-source-id: 015076030368bb67df24ed5892475534b8f8f272 (cherry picked from commit `a4314520de`)	2022-02-15 20:19:42 +00:00
jiej	2d110d514f	Nvfuser code bump 2_1_2022 (#72127 ) Summary: Things changed in this PR that requires review: 1. aten/src/ATen/core/interned_strings.h 2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation 3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry 4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported nvfuser code update: 1. codegen improvements and performance tuning 2. integration bug fixes for shape expression logic 3. kernel segmentation update to address perf regression from horizontal fusion 4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor Things reverted from local changes: aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439) Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127 Reviewed By: HamidShojanazeri Differential Revision: D34113233 Pulled By: jbschlosser fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74 (cherry picked from commit `e009bc5c4e`)	2022-02-15 00:43:16 +00:00
Mikhail Zolotukhin	1855b14922	[TensorExpr] Delet `DimArg` class. (#72390 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72390 This class didn't add much value and only caused more boilerplate code. This change removes the class and updates all the use cases with uses of `ExprHandle`. A side effect of this change is different names in loop variables, which caused massive mechanical changes in our tests. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D34030296 Pulled By: ZolotukhinM fbshipit-source-id: 2ba4e313506a43ab129a10d99e72b638b7d40108 (cherry picked from commit `c2ec46a058`)	2022-02-11 01:21:59 +00:00
Mike Iovine	c975b928ab	[SR][easy] CPU fuser uses native control flow (#72544 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544 Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction. I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks. ghstack-source-id: 148825816 Test Plan: New unit tests Reviewed By: d1jang Differential Revision: D34083080 fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50 (cherry picked from commit `2ef39e0e54`)	2022-02-10 18:40:39 +00:00
Don Jang	84729cef70	[Static Runtime] Fix a bug in aten::slice to honor optional arguments (#72530 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530 This bug was revealed from a failed attempt to run a feed/story model. Test Plan: - This fix was tested to successfully run the failed model: P479037453 - Added a unittest Reviewed By: mikeiovine Differential Revision: D34055801 fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d (cherry picked from commit `bfa2bfb81c`)	2022-02-09 17:05:45 +00:00
Mike Iovine	6c0521b919	[SR] Add native implementations for converted prim ops (#71474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474 The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime. Ops not included in this diff: * `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons * `prim::NumToTensor` and `aten::__get__.Dict` already exist ghstack-source-id: 148641179 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D33657816 fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e (cherry picked from commit `33f8f861ff`)	2022-02-08 23:25:34 +00:00
Raghavan Raman	4eb277ac61	[bench] Adding a cpp benchmark to compare performance of nnc with static and symbolic shapes (#72197 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72197 Test Plan: Imported from OSS Reviewed By: huiguoo Differential Revision: D33951742 Pulled By: navahgar fbshipit-source-id: 0412d61da158e98429f377469e1c331587390b14 (cherry picked from commit `c043fdfc79`)	2022-02-07 07:01:19 +00:00
Raghavan Raman	237e960ec9	[bench] Fix build issues with TensorExpr cpp benchmarks (#72196 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72196 Test Plan: Imported from OSS Reviewed By: dagitses Differential Revision: D33951743 Pulled By: navahgar fbshipit-source-id: f1b36bb3ba9cd649f0dbf0911f5a9e4791089e65 (cherry picked from commit `fbe5cadb5f`)	2022-02-07 07:01:19 +00:00
Raghavan Raman	38f696c0cd	[nnc] Add a API to unroll loops by a given factor (#72071 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72071 Reviewed By: ngimel Differential Revision: D33946250 Pulled By: navahgar fbshipit-source-id: 3f3f92054174620025a9d71154d006f1738953e2 (cherry picked from commit `d8b53598e9`)	2022-02-03 18:41:21 +00:00
Mike Iovine	cff5e22a72	[SR] Relax aten::__is__ constraint for SR enablement (#71807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807 There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in ``` def forward(x): y = x.detach() # Should be false, but we get True # after our EliminateNoOps pass return x is y ``` Test Plan: New unit test covers this case Reviewed By: d1jang Differential Revision: D33783668 fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4 (cherry picked from commit `019588f4ca`)	2022-02-03 12:18:46 +00:00
Mike Iovine	2d5296b0e7	[SR] Implement prim::Loop (#69838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838 Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186483 Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...` Reviewed By: d1jang Differential Revision: D33049595 fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532 (cherry picked from commit `8baf1753af`)	2022-02-02 19:30:50 +00:00
Mike Iovine	2aa699505d	[SR] Implement prim::If (#69837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837 Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186475 Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy test at top of stack Reviewed By: d1jang Differential Revision: D33045908 fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55 (cherry picked from commit `de3b12dc08`)	2022-02-02 19:30:50 +00:00
Mike Iovine	d2599701fd	[SR] Force sub-blocks to return at least one output (#69836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836 It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned. Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`. ghstack-source-id: 148186453 Test Plan: Sub-blocks with no return values tested at top of stack Reviewed By: d1jang Differential Revision: D33050420 fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149 (cherry picked from commit `c97fd07bf5`)	2022-02-02 19:30:50 +00:00
Mike Iovine	238dded10f	[SR] Graph pass to create owned refs of special IValues (#69835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835 `StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`. Note that this also applies to returned values in sub-blocks that are from outer scopes. ghstack-source-id: 148186452 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Added a new unit test with a graph that simply returns a constant. Tests with sub-blocks at top of stack. Reviewed By: d1jang Differential Revision: D33047519 fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b (cherry picked from commit `a8f89a12ee`)	2022-02-02 19:30:50 +00:00
Mike Iovine	4b789df68b	[SR] Add BlockRunner and handle sub-blocks (#69834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834 * Modify the `StaticModule` constructor to handle index initialization for sub-blocks. * Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes! ghstack-source-id: 148086245 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: d1jang Differential Revision: D33028039 fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6 (cherry picked from commit `3a9feffd92`)	2022-02-01 17:20:55 +00:00
Mike Iovine	7e6312a5df	[SR] Reverse iteration order in resetMemory (#71705 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705 This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows before we destroy the owning `IValue`, not after. ghstack-source-id: 147688982 Test Plan: New unit test covers this case ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors Reviewed By: ajyu Differential Revision: D33725435 fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb (cherry picked from commit `31339d9d38`)	2022-01-26 17:35:03 +00:00
Scott Wolchok	3a77fb244b	[PyTorch][Static Runtime] Delete cleanup_activations option (#71501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501 This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it. ghstack-source-id: 147385569 Test Plan: CI, rerun broken test from task Reviewed By: mikeiovine Differential Revision: D33669290 fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a (cherry picked from commit `5e4fefb109`)	2022-01-21 18:15:43 +00:00
Mike Iovine	ffdc0e23af	[SR] Add various missing native ops (#71113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113 This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff. Native ops * `aten::mul` (list variant) * `aten::sub` (int variant) * `aten::add` (list variant) * `aten::Int` Out variants * ~~`aten::gt`~~ (codegen will handle) * ~~`aten::eq`~~ (codegen will handle) ghstack-source-id: 146927552 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D33510756 fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766	2022-01-12 18:40:31 -08:00
Scott Wolchok	10b40acbdb	[PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122 See code comments for details; in brief, we repurpose support for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor` output a borrowed IValue that we have to clean up manually. If we have any other ops that always create a new reference to an existing Tensor, we can easily apply this same optimization. ghstack-source-id: 146482212 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) --do_profile output for local_ro (updated Dec 10): ``` swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt First iter time: 0.989023 ms Number of operators: 2037 Total number of managed tensors: 1597 Total number of managed output tensors: 0 Total number of unmanaged values: 2568 Number of unmanaged values requiring cleanup: 2568 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 50368 bytes Total number of reused tensors: 1010 Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%) swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt First iter time: 0.994703 ms Number of operators: 2551 Total number of managed tensors: 1146 Total number of managed output tensors: 0 Total number of unmanaged values: 4047 Number of unmanaged values requiring cleanup: 3533 Number of unmanaged values not requiring cleanup: 514 Total memory managed: 50048 bytes Total number of reused tensors: 559 Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%) ``` for local: (also Dec 10): ``` ==> Stable.local.profile.txt <== First iter time: 9.0909 ms Number of operators: 1766 Total number of managed tensors: 1894 Total number of managed output tensors: 0 Total number of unmanaged values: 2014 Number of unmanaged values requiring cleanup: 2014 Number of unmanaged values not requiring cleanup: 0 Total memory managed: 4541440 bytes Total number of reused tensors: 847 Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%) ==> TMCOFastAliasing.local.profile.txt <== First iter time: 7.5512 ms Number of operators: 2378 Total number of managed tensors: 1629 Total number of managed output tensors: 0 Total number of unmanaged values: 3503 Number of unmanaged values requiring cleanup: 2891 Number of unmanaged values not requiring cleanup: 612 Total memory managed: 3949312 bytes Total number of reused tensors: 586 Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%) ``` Reviewed By: hlu1 Differential Revision: D32318674 fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783	2022-01-04 22:36:13 -08:00
Scott Wolchok	4d8fc8693c	[PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223 ghstack-source-id: 146482215 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) Reviewed By: hlu1 Differential Revision: D31776259 fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3	2022-01-04 22:36:10 -08:00
Scott Wolchok	99a10c371f	[PyTorch][Static Runtime] Fix dtype changing between iterations for to() (#67394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394 ghstack-source-id: 146464294 Test Plan: Added new test, which failed but now passes. Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral ``` Stable, local ======================================== I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1 I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081 I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636 I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557 I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982 I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795 I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164 0 FixToDtypeChanges, local ======================================== I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66 I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257 I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847 I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632 I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787 I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706 ``` Reviewed By: hlu1 Differential Revision: D31972722 fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15	2022-01-04 22:34:49 -08:00
Peter Bell	fa09099ba3	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D33336948 Pulled By: albanD fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113	2022-01-02 13:09:19 -08:00
Mike Iovine	6a84449290	[SR] Fast path for VarStack on scalars (#70210 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210 Add a fast-path for `VarStack` nodes for when the inputs are scalars. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack` Reviewed By: hlu1 Differential Revision: D33177498 fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6	2021-12-23 10:31:17 -08:00
soulitzer	21c6de9fdc	Extend autograd functional benchmarking to run vectorized tasks (#67045 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67045 To run: `python benchmarks/functional_autograd_benchmark/functional_autograd_benchmark.py --gpu -1 --model-filter=ppl _robust_reg --num-iter 100` ``` Results for model ppl_robust_reg on task vjp: 0.0012262486852705479s (var: 2.2107682351446556e-10) Results for model ppl_robust_reg on task vhp: 0.002099371049553156s (var: 6.906406557760647e-10) Results for model ppl_robust_reg on task jvp: 0.001860950025729835s (var: 1.1251884146634694e-10) Results for model ppl_robust_reg on task hvp: 0.003481731517240405s (var: 2.2713633751614282e-10) Results for model ppl_robust_reg on task jacobian: 0.0012128615053370595s (var: 1.3687526667638394e-09) Results for model ppl_robust_reg on task hessian: 0.009885427542030811s (var: 9.366265096844018e-09) Results for model ppl_robust_reg on task hessian_fwdrev: 0.005268776323646307s (var: 2.4293791422991262e-09) Results for model ppl_robust_reg on task hessian_revrev: 0.002561321249231696s (var: 7.557877101938004e-10) Results for model ppl_robust_reg on task jacfwd: 0.002619938924908638s (var: 5.109343503839625e-10) Results for model ppl_robust_reg on task jacrev: 0.0013469004770740867s (var: 3.1857563254078514e-09) ``` Notes: - We go through batched fallback for both - ppl_robust_reg takes 3 tensor inputs and returns a single scalar output - this means that jacobian is equivalent to doing vjp and vmap would not help us - we expect jacfwd to be slower than jacrev Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D33265947 Pulled By: soulitzer fbshipit-source-id: 14f537a1376dea7e5afbe0c8e97f94731479b018	2021-12-21 17:20:29 -08:00
Raghavan Raman	91da2d5fa1	[StaticRuntime] Refactor StaticModule to pass in sample inputs (#69473 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473 This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph. ghstack-source-id: 146059041 Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test Reviewed By: donaldong Differential Revision: D32320084 fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60	2021-12-21 11:20:25 -08:00
Donald Dong	24f16de987	[Static Runtime] Support native op split_with_sizes (#69999 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999 This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime). split_with_sizes can be called directly from cpp API, or in `torch.split` when `split_sizes` is a list. This diff adds support for both use cases. Test Plan: - Added unit tests. Made sure the operators are used - Benchmark ``` ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \ --method_name=user.forward --pt_cleanup_activations=1 \ --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \ --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \ --input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \ --recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1 ``` #### Before ``` Static runtime ms per iter: 3.62073. Iters per second: 276.187 0.0471904 ms. 1.31501%. aten::split_with_sizes (5 nodes) ``` #### After ``` Static runtime ms per iter: 3.44374. Iters per second: 290.382 0.0432057 ms. 1.34276%. aten::split_with_sizes (5 nodes, native) ``` Reviewed By: swolchok Differential Revision: D33141006 fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365	2021-12-20 18:32:54 -08:00
Mike Iovine	65f54bc000	[SR] Optimize VarStack (#68750 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750 There was some room for optimization in static runtime's `prim::VarStack`: * Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor>` Skip the memory overlap check * Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`) Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack` Reviewed By: swolchok Differential Revision: D32596934 fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0	2021-12-20 11:46:11 -08:00
Nikita Shulga	26e32988bd	Revert D32596264: Codegen: TraceType only includes operators being registered Test Plan: revert-hammer Differential Revision: D32596264 (`e66a8ab4f5`) Original commit changeset: 2f28b62d7b99 Original Phabricator Diff: D32596264 (`e66a8ab4f5`) fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12	2021-12-17 11:20:12 -08:00
Peter Bell	e66a8ab4f5	Codegen: TraceType only includes operators being registered (#68691 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691 TraceType is a sharded file, so by only including specific operator headers, we ensure that changing one (non-method) operator only needs one shard to be re-compiled. This also changes all the included autograd and jit headers from including `ATen/ATen.h` to just including `ATen/core/Tensor.h`. Test Plan: Imported from OSS Reviewed By: jbschlosser, malfet Differential Revision: D32596264 Pulled By: albanD fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621	2021-12-17 10:35:05 -08:00
Scott Wolchok	66406ee0f7	[PyTorch][Static Runtime] Fix to() w/dtype bool (#69935 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935 Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`. ghstack-source-id: 145661358 Test Plan: Added test for dtype bool. Ran CMF local_ro net: before: ``` I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67 I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11 I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93 I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57 I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76 I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45 I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4 I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43 I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04 I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8 I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367 ``` after: ``` I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15 I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91 I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96 I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56 I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55 I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11 I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84 I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02 I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81 I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6 I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761 ``` Reviewed By: hlu1 Differential Revision: D33106675 fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862	2021-12-15 15:26:56 -08:00
Mike Iovine	873585da2b	[SR] Improve set_inputs (#69087 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087 This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`: 1. Eliminate code duplication between rvalue/lvalue overloads 2. Add type checks 3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed. 4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.\| ghstack-source-id: 145599837 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32711705 fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b	2021-12-15 09:31:19 -08:00
Mike Iovine	102684b252	[SR] Fix stack/concat bug (#68777 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68777 Fixed some cases where negative dimensions were not handled correctly * `_stack_cpu` calls `maybe_wrap_dim`, but `_stack_cpu_out` does not. This is only problematic when `_stack_cpu_out` forwards to the serial kernel: [ref](https://www.internalfb.com/code/fbsource/[1b5af978b48f2e5d308d42b588bde3275869a57b]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=1541-1547). * concat also needs to wrap its dim Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Added new tests to cover this case Reviewed By: hlu1 Differential Revision: D32604623 fbshipit-source-id: 00aaa42817cd2d3e7606ce75ab5a9744645118cf	2021-12-14 16:26:27 -08:00
Donald Dong	f7294cd865	[Static Runtime] Skip ReplaceWithCopy when inputs have writters (#69819 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69819 We should skip ReplaceWithCopy if the inputs to the operator can be updated during inference. For a set of tensors that share data, ReplaceWithCopy should not happen to any of them if there exists updates to any of them. Currently, the check in place has missed some cases (suppose there exists updates, and uses <= 1). This diff addresses the missing cases by querying AliasDB. Test Plan: - Added test cases, including a one that is problematic before this diff - CI Reviewed By: mikeiovine Differential Revision: D33052562 fbshipit-source-id: 61f87e471805f41d071a28212f2f457e8c6785e7	2021-12-14 09:39:49 -08:00
Richard Barnes	29d759948e	use irange for loops 2 (#66746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746 Modified loops in files under fbsource/fbcode/caffe2/ from the format `for(TYPE var=x0;var<x_max;x++)` to the format `for(const auto var: irange(xmax))` This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand. Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D31705361 fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268	2021-12-10 04:26:23 -08:00
Mike Iovine	f87f1d08e8	[SR] assignStorageToManagedTensors returns a vector (#69568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69568 Non-empty vectors should never be passed to `assignStorageToManagedTensors` and `assignStorageToManagedOutputTensors`. Presumably, this out-variant convention was adopted to avoid move-assigning the corresponding attribtues in `MemoryPlanner`. But the cost of a vector move-assign is not high, and this function type signature is safer. Test Plan: `buck test caffe2/bechmarks/static_runtime:static_runtime_cpptest` Reviewed By: donaldong Differential Revision: D32729289 fbshipit-source-id: 88f19de8eb89d8a4f1dd8bbd4d9e7f686e41888b	2021-12-09 17:01:48 -08:00
Don Jang	9aa1b3e396	[Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595 This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it. Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D32908341 fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364	2021-12-09 15:11:03 -08:00
Mike Iovine	1c43b1602c	[SR] Scope exit guard for memory planner deallocation (#68795 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795 This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor. Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on. There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here. Test Plan: New unit tests `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32609915 fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e	2021-12-08 16:41:52 -08:00
Mike Iovine	008469c5e2	[SR] Simplify memory re-use algorithm (#68302 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302 Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis. Test Plan: ## Re-use metrics `inline_cvr` (294738512_58) Before * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4601984 bytes Total number of reused tensors: 1183 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29696 bytes Total number of reused tensors: 959 ``` After * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4520000 bytes Total number of reused tensors: 1198 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29120 bytes Total number of reused tensors: 963 ``` Reviewed By: hlu1 Differential Revision: D32370424 fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf	2021-12-07 13:25:42 -08:00
Don Jang	9663e08674	[Static Runtime] Fix a bug that aten::embedding_bag keeps cannot handle resized input tensors (#69219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219 This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts. Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change. Reviewed By: mikeiovine Differential Revision: D32544399 fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29	2021-12-03 19:01:45 -08:00
Scott Wolchok	b22e4d4aea	[PyTorch][SR] Add more to() tests & extend debug logging in testStaticRuntime (#67219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219 I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2. ghstack-source-id: 144507287 Test Plan: Used during development of D31776259 Reviewed By: hlu1 Differential Revision: D31847566 fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3	2021-12-02 10:34:54 -08:00
Hao Lu	ed3b73fd4d	[Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639 Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()` - Only enable this check for native and fallback ops that are not inplace or view ops - Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it - Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (`39ab417107`) Reviewed By: mikeiovine Differential Revision: D32553708 fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13	2021-12-02 05:03:12 -08:00
Mike Iovine	ee4cfaa286	[SR] Add utility class to determine tensor ranges (#68284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284 Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: swolchok Differential Revision: D32397207 fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6	2021-11-19 13:10:55 -08:00
Ben Koopman	c2c859bdf2	[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3	2021-11-19 06:29:19 -08:00
CodemodService FBSourceClangFormatLinterBot	143491e0ad	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D32484422 fbshipit-source-id: 5c836dc7d06f12e64cc4bb1e85d8fa4b62a29b85	2021-11-17 07:27:04 -08:00
jjsjann123	0dc3f829d9	Nvfuser code bump 11 5 (#67943 ) Summary: nvfuser code update: 1. Tuning heuristics on schedulers for reduction/normalization kernels; 2. bfloat16 on IO tensor support; 3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last; 4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`. Things that are reverted from our local branch: 1. changes on some entries in autodiff 2. aten::gelu with approximation 3. native_dropout(_backward) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943 Reviewed By: ngimel Differential Revision: D32288709 Pulled By: dzhulgakov fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1	2021-11-17 01:22:17 -08:00
Don Jang	aa9ee8d02a	[Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368 Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation. However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable. This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance. This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests. Thanks to hlu1 for proposing this non-intrusive improvement idea :D Test Plan: This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed. ==AFTER * CMF/local memory turnover: 393608 latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087 * CMF/local_ro memory turnover:387288 latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101 ==BEFORE * CMF/local memory turnover: 459888 latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18 * CMF/local_ro memory turnover: 420832 latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453 ==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr: ==AFTER Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ==BEFORE Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) Reviewed By: swolchok Differential Revision: D32337548 fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a	2021-11-16 20:28:48 -08:00
Michael Suo	5c3529a86d	[lint] small pass to make lint clean (#68367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2	2021-11-16 10:27:00 -08:00
Scott Wolchok	639258499f	[PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935 Rationale should be documented in code comments. In short, we can avoid heap-allocating arrays of input indexes for operators with 5 or fewer inputs, at the cost of a tag bit check on access. ghstack-source-id: 143429112 Test Plan: Patched d1jang's D32181666, which prints static runtime memory usage. Previous diff, local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 ``` This diff, local: ``` I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064 ``` 4.5% savings (16144 bytes) Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff. Reviewed By: hlu1 Differential Revision: D32216573 fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c	2021-11-16 10:21:12 -08:00
Scott Wolchok	6acde23bec	[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934 This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode. ghstack-source-id: 143429113 Test Plan: Patched d1jang's diff to measure memory turnover around SR startup. Previous diff, CMF local: ``` I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120 ``` This diff, CMF local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 72912 bytes (17%) savings ``` Perf looks neutral; see next diff (D32216573) test plan for details. Reviewed By: hlu1 Differential Revision: D32190751 fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc	2021-11-16 10:19:50 -08:00
Don Jang	9cb65df79f	[Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939 With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc). Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero. This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely. A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate. Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code. Reviewed By: swolchok Differential Revision: D32219731 fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf	2021-11-11 17:31:07 -08:00
Hao Lu	47bc47f2b9	[SR] Add runtime check to correct bad schema alias info (#67825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825 The comment explains how it works. Test Plan: A small regression to local and local_ro if we only enable it for fallback ops. ``` ## local_ro # before I1103 21:25:05.250440 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247 I1103 21:25:08.629221 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319 I1103 21:25:12.005179 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759 I1103 21:25:12.005236 `2636751` PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619 # after # # only enable for fall back ops: 0.7% I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481 I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262 I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51 I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424 # enable for all (no clone): 4.7% I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005 I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14 I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506 I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765 # local # before I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233 I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641 I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735 I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206 # after # # only enable for fall back ops: 0.1% I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349 I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601 I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659 I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658 # enable for all (no clone): 0.9% I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783 I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611 I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037 I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498 ``` Reviewed By: d1jang Differential Revision: D32124812 fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa	2021-11-10 19:35:11 -08:00
Mike Iovine	ecd5b1a8d4	[SR] Native implementation for aten::split (#67476 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31994040 fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a	2021-11-10 10:23:03 -08:00
Hao Lu	1b2a366932	[SR] Enforce checks for resizing of the internal buffer in MemoryPlanner in unit tests (#67941 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941 I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests. From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in. Test Plan: Facebook ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` Reviewed By: swolchok Differential Revision: D32196204 fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71	2021-11-09 16:07:40 -08:00
David Berard	b546cdf401	[SR] Out variant for prim::NumToTensor (#67856 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856 Returns a tensor constructed from scalar input Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Ran ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=NumToTensorScalar --v=1 ``` and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`. Reviewed By: mikeiovine Differential Revision: D32014194 fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542	2021-11-08 09:02:58 -08:00
Mike Iovine	5bc89275dd	[SR] Eliminate no-ops (#67437 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437 Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this. Test Plan: `buck test caffe2/test:jit -- test_freezing` Reviewed By: hlu1 Differential Revision: D31980843 fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0	2021-11-08 08:42:33 -08:00
Bert Maher	4b084bc832	Benchmarks for various fusers (#67622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67622 Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D32171063 Pulled By: bertmaher fbshipit-source-id: 40d3a7adcc52aba3b051e382ec5ec4ee7e43d81b	2021-11-04 18:57:17 -07:00
Hao Lu	938bab0bfd	[PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865 - Add int version of vectorized PrefixSum - Use unaligned load/store instructions - Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan Test Plan: ``` buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 numactl -m 0 -C 5 \ ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench ``` For full benchmark results, see P465274613 ``` PrefixSumBench/LocalInt/64 57 ns 56 ns 12414048 GB/s=9.06239G/s PrefixSumBench/LocalInt/256 221 ns 221 ns 3160853 GB/s=9.28635G/s PrefixSumBench/LocalInt/1024 818 ns 817 ns 857922 GB/s=10.0235G/s PrefixSumBench/LocalInt/4096 3211 ns 3210 ns 217614 GB/s=10.2093G/s PrefixSumBench/LocalInt/16384 12806 ns 12804 ns 54805 GB/s=10.2364G/s PrefixSumBench/LocalInt/65536 51115 ns 51079 ns 13741 GB/s=10.2643G/s PrefixSumBench/LocalInt/262144 205974 ns 205912 ns 3401 GB/s=10.1847G/s PrefixSumBench/LocalInt/1048576 829523 ns 828859 ns 845 GB/s=10.1207G/s PrefixSumBench/LocalIntAVX2/64 45 ns 45 ns 15568113 GB/s=11.3549G/s PrefixSumBench/LocalIntAVX2/256 208 ns 208 ns 3371174 GB/s=9.86913G/s PrefixSumBench/LocalIntAVX2/1024 893 ns 892 ns 783154 GB/s=9.18629G/s PrefixSumBench/LocalIntAVX2/4096 3618 ns 3613 ns 193834 GB/s=9.06838G/s PrefixSumBench/LocalIntAVX2/16384 14416 ns 14411 ns 48564 GB/s=9.09543G/s PrefixSumBench/LocalIntAVX2/65536 57650 ns 57617 ns 12156 GB/s=9.09952G/s PrefixSumBench/LocalIntAVX2/262144 230855 ns 230612 ns 3035 GB/s=9.09386G/s PrefixSumBench/LocalIntAVX2/1048576 924265 ns 923777 ns 758 GB/s=9.08077G/s PrefixSumBench/LocalIntAVX512/64 23 ns 23 ns 24876551 GB/s=22.0697G/s PrefixSumBench/LocalIntAVX512/256 95 ns 95 ns 7387386 GB/s=21.556G/s PrefixSumBench/LocalIntAVX512/1024 435 ns 435 ns 1609682 GB/s=18.8425G/s PrefixSumBench/LocalIntAVX512/4096 1815 ns 1815 ns 385462 GB/s=18.0561G/s PrefixSumBench/LocalIntAVX512/16384 7479 ns 7476 ns 93660 GB/s=17.5335G/s PrefixSumBench/LocalIntAVX512/65536 30171 ns 29879 ns 23430 GB/s=17.5468G/s PrefixSumBench/LocalIntAVX512/262144 125805 ns 125631 ns 5570 GB/s=16.6929G/s PrefixSumBench/LocalIntAVX512/1048576 504216 ns 503983 ns 1384 GB/s=16.6446G/s PrefixSumBench/ExclusiveScanIntAVX512/64 23 ns 23 ns 30058295 PrefixSumBench/ExclusiveScanIntAVX512/256 101 ns 101 ns 7398498 PrefixSumBench/ExclusiveScanIntAVX512/1024 435 ns 434 ns 1403877 PrefixSumBench/ExclusiveScanIntAVX512/4096 1979 ns 1978 ns 354016 PrefixSumBench/ExclusiveScanIntAVX512/16384 7828 ns 7819 ns 89551 PrefixSumBench/ExclusiveScanIntAVX512/65536 31206 ns 31192 ns 22408 PrefixSumBench/ExclusiveScanIntAVX512/262144 130106 ns 130023 ns 5388 PrefixSumBench/ExclusiveScanIntAVX512/1048576 525515 ns 524976 ns 1244 ``` Reviewed By: navahgar, swolchok Differential Revision: D32011740 fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063	2021-11-04 14:00:19 -07:00
Bin Wen	1baed45c6b	[fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663 mostly follow the example of quantized::linear (D28428734 (`4d7abdbdad`)) to enable out-variant for quantized::linear_dynamic_fp16. Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d Test Plan: buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 \| pastry before: P465201159 0.929067 ms. 31.808%. quantized::linear_dynamic_fp16 (16 nodes) 0.921679 ms. 31.7324%. quantized::linear_dynamic_fp16 (16 nodes) 0.919127 ms. 31.7404%. quantized::linear_dynamic_fp16 (16 nodes) after: P465203015 0.90898 ms. 31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.9127 ms. 30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.879148 ms. 31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant) unit test logic refers https://fburl.com/code/vv0rry13 buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D32001168 fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3	2021-11-03 22:39:04 -07:00
Hao Lu	89b02fc70b	[StaticRuntime][Easy] Correct typos in test_static_runtime (#67739 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67739 Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: mikeiovine Differential Revision: D32125879 fbshipit-source-id: bd989e5088edff87624b858bd9045dfe9da3fbe7	2021-11-03 13:24:46 -07:00
Shashank Chaudhry	89c4e8c22b	[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8	2021-11-03 12:23:14 -07:00
Scott Wolchok	82f7f8d471	[PyTorch] Adopt IValue::toTupleRef() where obvious (#65505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505 Generated with `fastmod -m 'toTuple(\s)->' 'toTupleRef()${1}.'` , followed by `fastmod '(std::move$.)toTupleRef\($.' '${1}toTuple()->'` to unbreak 2 callsites. ghstack-source-id: 142065835 Test Plan: CI Reviewed By: gchanan Differential Revision: D31131025 fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34	2021-11-02 10:22:18 -07:00
Mike Iovine	39ad7b670e	[SR] Native implementation for aten::squeeze (#67441 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31992093 fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e	2021-11-01 08:22:57 -07:00
Mike Iovine	0d7cf825fc	[SR] Drop support for aten::__is__ and aten::__isnot__ (#67550 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550 `aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations. Some examples: - Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this: ``` def forward(x): y = x.detach() return x is y ``` We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`. - `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards ``` def forward(x): y = x.to(x.dtype) return x is y ``` - And many more, `FuseListUnpack` can break too Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D32022584 fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e	2021-11-01 04:45:14 -07:00
Mike Iovine	354363b57a	[SR] Native implementation for aten::size (#67346 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31965159 fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5	2021-10-28 14:18:17 -07:00
Mike Iovine	afb8434440	[SR] Native implementation for aten::view (#67341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31962589 fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435	2021-10-28 13:37:46 -07:00
Bin Wen	6900aacf54	[fbcode] Fix operator_benchmark with jit mode (#67382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f	2021-10-28 08:48:10 -07:00
Mike Iovine	7da9c4ed2e	[SR] NNC out variant for aten::where (#67255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255 Add an out variant for `aten::where`. Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well. Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: navahgar Differential Revision: D31923886 fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837	2021-10-28 06:48:22 -07:00
Hao Lu	9ebc6357b3	[SR] Vectorize int version of fmod (#67313 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313 Reviewed By: swolchok Differential Revision: D31889868 fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd	2021-10-27 17:02:53 -07:00
Mike Iovine	a0495b3cdb	[SR] Remove unused operator() overload (#67001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001 The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere. This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` `buck test caffe2/test:static_runtime` Reviewed By: hlu1 Differential Revision: D31821990 fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a	2021-10-25 08:18:58 -07:00
Mike Iovine	f2582a59d0	[SR] Add rvalue overload for operator() (#66648 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648 Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes: 1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`. 2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`. Old: ``` operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs) ``` New: ``` template <class IValueList> operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs) ``` The implementations use perfect forwarding to invoke the correct overload of `set_inputs`. Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass. Reviewed By: hlu1 Differential Revision: D31659973 fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed	2021-10-22 10:51:47 -07:00
Aditya Pillai	40a8a50913	Add static_runtime::fused_equally_split (#2 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881 Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator. Test Plan: ``` adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s ❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` and sandcastle strange_what_could_go_wrong Reviewed By: mikeiovine Differential Revision: D31742293 fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39	2021-10-22 10:26:49 -07:00
Don Jang	18bbc4c2b7	[Static Runtime] Fix a bug in aten::index (#66940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940 `aten::index`'s schema is as follows: ``` "aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor ``` The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor. Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`. Reviewed By: hlu1 Differential Revision: D31712145 fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3	2021-10-20 15:51:21 -07:00
lezcano	0974215c4d	Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181 This PR replaces all the calls to: - `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python - `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python. It also simplifies two pieces of code, and fixes one bug where a pair of parentheses were missing in the function `make_symmetric_matrices`. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31692896 Pulled By: anjali411 fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a	2021-10-18 13:02:25 -07:00
Xue Li	2f099c7555	Revert D30652629: use irange for loops Test Plan: revert-hammer Differential Revision: D30652629 (`687c2267d4`) Original commit changeset: 0ae6c4bbbb55 fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3	2021-10-15 15:23:10 -07:00
Richard Barnes	687c2267d4	use irange for loops (#66234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234 Modified loops in files under fbsource/fbcode/caffe2/ from the format `for(TYPE var=x0;var<x_max;x++)` to the format `for(const auto var: irange(xmax))` This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand. bypass_size_limit allow-large-files Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D30652629 fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e	2021-10-15 13:50:33 -07:00
Vasiliy Kuznetsov	d802877dfa	speed up quantized interpolate for channels last (#66525 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459	2021-10-14 08:11:26 -07:00
Hao Lu	6634570aef	[SR] Fix bug in ValueGroup (#66470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470 Reviewed By: d1jang Differential Revision: D31566348 fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3	2021-10-13 19:26:38 -07:00
Scott Wolchok	d30397d42a	[PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429 The sizes of these arrays can't change, so there's no need to waste an extra pointer on them. ghstack-source-id: 140532722 Test Plan: CI I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%. Reviewed By: hlu1 Differential Revision: D31096042 fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395	2021-10-13 19:13:20 -07:00
Mike Iovine	37db650c9c	[Static Runtime] Clone test does not use uninitialized memory (#66557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557 The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31611961 fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8	2021-10-13 14:02:34 -07:00
Don Jang	736fa09a9a	[Static Runtime] Manage output tensors (#65515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515 This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows: - At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff. - At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple). - The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client. - In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory. - Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies. This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed. Test Plan: - Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths. - Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`). Reviewed By: hlu1 Differential Revision: D31049221 fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383	2021-10-11 09:50:54 -07:00
Don Jang	416f593080	[Static Runtime] Group graph nodes into input aliases & output aliases (#65517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517 This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows: - input_aliases: values that are either inputs or contain aliases of inputs or constants. - output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases. - Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph. `output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance. Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before. Reviewed By: hlu1 Differential Revision: D30940955 fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc	2021-10-07 14:35:12 -07:00
Mike Iovine	d5f64afc38	[Static Runtime] Support aten::to.prim_dtype overload (#64928 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928 Added support this overload of `aten::to`: ``` aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a\|b) ``` Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to` Reviewed By: hlu1 Differential Revision: D30901398 fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb	2021-10-07 10:22:44 -07:00
Mike Iovine	6d7fab5929	[Static Runtime][easy] Clone scripts do not use aten::add (#66161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161 `aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31406764 fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e	2021-10-06 12:37:39 -07:00
Alexandr Guzhva	b8e1999253	[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d	2021-10-06 08:07:42 -07:00
Mike Iovine	ed50fa2513	[Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849 Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31282901 fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9	2021-10-04 20:46:07 -07:00
Nikita Shulga	4c4525fa5c	Compile without -Wno-unused-variable (take 2) (#66041 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Do not delete `caffe2::OperatorBase::Output` calls as they have side effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041 Reviewed By: ngimel Differential Revision: D31360142 Pulled By: malfet fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8	2021-10-04 20:39:39 -07:00
Don Jang	89ed9bdaee	[Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516 This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors. Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values. This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (`3fb33b38b9`)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production. This change fixes the root cause from `_embedding_bag_cpu_impl_out` by replacing alias creation with copying. Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times: 1. profile run 2. run using the profile to allocate managed tensors 3. reuse the managed tensors -- newly added Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`. Test Plan: - Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248). - Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future. Reviewed By: hlu1 Differential Revision: D31104345 fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f	2021-10-03 15:10:58 -07:00
Nikita Shulga	e4ee5ca698	Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable Test Plan: revert-hammer Differential Revision: D31326599 (`a6280ab653`) Original commit changeset: 924155f1257a fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf	2021-10-01 20:40:47 -07:00
Nikita Shulga	a6280ab653	Compile without -Wno-unused-variable (#65954 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954 Reviewed By: ngimel Differential Revision: D31326599 Pulled By: malfet fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3	2021-10-01 17:40:47 -07:00
Scott Wolchok	ffede499b2	[PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499 When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc. ghstack-source-id: 139549027 Test Plan: Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options). Before: I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252 I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871 I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916 I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093 I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971 mean: 7.144172, stddev: 0.1283 After: I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122 I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579 I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033 I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833 I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85 mean: 6.800632, stddev: 0.013227 Looks like about a 5.3% improvement. Reviewed By: hlu1 Differential Revision: D31125492 fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823	2021-10-01 12:13:33 -07:00
Vasiliy Kuznetsov	e3af4be963	pytorch quantization ao migration phase 2: caffe2/benchmark (#65833 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 \| xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116	2021-10-01 06:17:36 -07:00
Mikhail Zolotukhin	3a0165da49	[TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551 Previously we had a big switch on Op kind to decide how to lower a given JIT operator to NNC. This PR changes this switch to a hash table lookup. Why? This helps us with at least two things: 1) With this approach we can easily check if we know how to handle a given node in advance - i.e. we can inspect the entire graph and tell whether it's possible to compile it or not without actually trying to do that and dying in the middle. This would allow us to, say, provide user-friendly error messages in AOT workflow. 2) We can switch to use schema instead of op kind to determine correct lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963) and using it instead of schema can lead to bugs. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D31148926 Pulled By: ZolotukhinM fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704	2021-09-30 22:56:18 -07:00
Raghavan Raman	8f3983254b	[MicroBench] Added a micro benchmark for prefix sum (#65790 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790 Here are the results of the benchmark: * ATen - version that calls `at::cumsum` * NNC - a simple prefix-sum loop implemented in NNC (not vectorized) * Local - a C++ implementation of the simple prefix-sum loop * LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2 * LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512. The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20. ``` $ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench Run on (36 X 1601 MHz CPU s) 2021-09-28 23:13:12 ------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------ PrefixSumBench/ATen/64 1289 ns 1289 ns 543199 GB/s=397.069M/s PrefixSumBench/ATen/256 1867 ns 1867 ns 374232 GB/s=1096.8M/s PrefixSumBench/ATen/1024 4169 ns 4169 ns 167889 GB/s=1.9649G/s PrefixSumBench/ATen/4096 14137 ns 14136 ns 49266 GB/s=2.31806G/s PrefixSumBench/ATen/16384 49887 ns 49883 ns 13988 GB/s=2.6276G/s PrefixSumBench/ATen/65536 193742 ns 193686 ns 3628 GB/s=2.7069G/s PrefixSumBench/ATen/262144 764803 ns 764774 ns 917 GB/s=2.74219G/s PrefixSumBench/ATen/1048576 3040653 ns 3040277 ns 231 GB/s=2.75916G/s PrefixSumBench/Local/64 586 ns 586 ns 1197003 GB/s=873.244M/s PrefixSumBench/Local/256 1077 ns 1077 ns 646265 GB/s=1.90143G/s PrefixSumBench/Local/1024 3050 ns 3050 ns 229458 GB/s=2.68579G/s PrefixSumBench/Local/4096 11910 ns 11910 ns 58953 GB/s=2.75132G/s PrefixSumBench/Local/16384 43204 ns 43202 ns 16081 GB/s=3.03393G/s PrefixSumBench/Local/65536 167966 ns 167966 ns 4154 GB/s=3.12139G/s PrefixSumBench/Local/262144 667631 ns 667613 ns 1048 GB/s=3.14127G/s PrefixSumBench/Local/1048576 2654785 ns 2654631 ns 264 GB/s=3.15999G/s PrefixSumBench/NNC/64 642 ns 642 ns 1095277 GB/s=797.442M/s PrefixSumBench/NNC/256 1139 ns 1138 ns 617214 GB/s=1.799G/s PrefixSumBench/NNC/1024 3103 ns 3103 ns 225531 GB/s=2.63979G/s PrefixSumBench/NNC/4096 12053 ns 12052 ns 58084 GB/s=2.71883G/s PrefixSumBench/NNC/16384 43227 ns 43225 ns 16192 GB/s=3.03231G/s PrefixSumBench/NNC/65536 168065 ns 168056 ns 4153 GB/s=3.11972G/s PrefixSumBench/NNC/262144 668974 ns 668921 ns 1045 GB/s=3.13513G/s PrefixSumBench/NNC/1048576 2657464 ns 2657341 ns 263 GB/s=3.15677G/s PrefixSumBench/LocalAVX2/64 523 ns 523 ns 1351308 GB/s=979.537M/s PrefixSumBench/LocalAVX2/256 755 ns 755 ns 927762 GB/s=2.71159G/s PrefixSumBench/LocalAVX2/1024 1759 ns 1759 ns 400355 GB/s=4.65609G/s PrefixSumBench/LocalAVX2/4096 6708 ns 6706 ns 103959 GB/s=4.88649G/s PrefixSumBench/LocalAVX2/16384 22143 ns 22142 ns 31229 GB/s=5.91951G/s PrefixSumBench/LocalAVX2/65536 83649 ns 83642 ns 8350 GB/s=6.26828G/s PrefixSumBench/LocalAVX2/262144 330433 ns 330427 ns 2133 GB/s=6.34679G/s PrefixSumBench/LocalAVX2/1048576 1302301 ns 1302179 ns 537 GB/s=6.44198G/s PrefixSumBench/LocalAVX512/64 474 ns 474 ns 1459151 GB/s=1080.8M/s PrefixSumBench/LocalAVX512/256 576 ns 576 ns 1217442 GB/s=3.55524G/s PrefixSumBench/LocalAVX512/1024 994 ns 994 ns 703387 GB/s=8.24434G/s PrefixSumBench/LocalAVX512/4096 3642 ns 3641 ns 190646 GB/s=8.99857G/s PrefixSumBench/LocalAVX512/16384 10140 ns 10140 ns 68947 GB/s=12.9267G/s PrefixSumBench/LocalAVX512/65536 35739 ns 35736 ns 19567 GB/s=14.6711G/s PrefixSumBench/LocalAVX512/262144 156415 ns 156413 ns 4467 GB/s=13.4078G/s PrefixSumBench/LocalAVX512/1048576 613952 ns 613876 ns 1144 GB/s=13.665G/s ``` Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D31253849 Pulled By: navahgar fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4	2021-09-30 14:44:52 -07:00
Mike Iovine	5f7ab7be6f	[Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741 This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing. Reviewed By: hlu1 Differential Revision: D31234944 fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4	2021-09-29 08:04:20 -07:00
Philip Meier	aebde1bc2b	deprecate device getter from `torch.testing` namespace (#63844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732	2021-09-29 02:40:52 -07:00
Kushashwa Ravi Shrimali	4752453d27	[Structured Kernels] Port for `baddbmm` and `bmm` (#64805 ) Summary: This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation. Issue tracker: https://github.com/pytorch/pytorch/issues/55070 cc: ysiraichi ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805 Reviewed By: gchanan Differential Revision: D31134454 Pulled By: ezyang fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261	2021-09-28 11:07:31 -07:00
Ben Koopman	6a6ee92e36	[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727	2021-09-27 16:01:49 -07:00
Mike Iovine	ef9e560796	[Static Runtime] Add aten::remainder out variant (#64967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967 Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder` Reviewed By: d1jang Differential Revision: D30915469 fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9	2021-09-24 07:51:39 -07:00
Raghavan Raman	31584d065e	[Static Runtime] Added NNC implementation for signed log1p kernel. (#65387 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387 Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op. Also, added a SR microbenchmark for this kernel which shows the performance improvement. Without fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 1953 ns 1953 ns 358746 BM_signed_log1p/64 2049 ns 2049 ns 342145 BM_signed_log1p/512 3291 ns 3291 ns 214342 BM_signed_log1p/4096 15559 ns 15559 ns 44420 BM_signed_log1p/32768 101936 ns 101935 ns 6843 BM_signed_log1p/65536 194792 ns 194789 ns 3615 ``` With NNC fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 369 ns 369 ns 1896179 BM_signed_log1p/64 497 ns 497 ns 1406995 BM_signed_log1p/512 1618 ns 1618 ns 430209 BM_signed_log1p/4096 11327 ns 11326 ns 61463 BM_signed_log1p/32768 84099 ns 84086 ns 8325 BM_signed_log1p/65536 166531 ns 166510 ns 4186 ``` This clearly shows >15% improvement in performance of this kernel with NNC fusion. On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops: without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved) with NNC fusion: `0.55%` Test Plan: `buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1) ``` get 57220 prediction values get 57220 prediction values max_error: 0 total: 0 ``` Reviewed By: hlu1 Differential Revision: D30609492 fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd	2021-09-22 15:53:33 -07:00
Rodrigo Berriel	a0dea074b2	Remove `.data` from benchmarks and tensorboard (#65389 ) Summary: Related to https://github.com/pytorch/pytorch/issues/30987 and https://github.com/pytorch/pytorch/issues/33628. Fix the following tasks: - Remove the use of `.data` in all our internal code: - [x] `benchmarks/` - [x] `torch/utils/tensorboard/` cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 albanD gchanan Pull Request resolved: https://github.com/pytorch/pytorch/pull/65389 Reviewed By: soulitzer Differential Revision: D31093464 Pulled By: albanD fbshipit-source-id: 3a9c8834fd544a59a1cc2b930ae538fd1d46b232	2021-09-22 11:16:59 -07:00
jiej	127c9402d0	Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137 ) Summary: This reverts commit `03389dc851`. Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745 Fixes the windows build failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d	2021-09-22 04:54:51 -07:00
Hao Lu	ce101fed02	[PyPer] copy-free freeze_module (#65118 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118 Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase. Reviewed By: eellison, movefast1990 Differential Revision: D30955053 fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0	2021-09-20 17:25:10 -07:00
Mike Iovine	99e4ab5d44	[Static Runtime] Implement and enable variadic tuple unpack (#64934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934 Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from: ``` %0, %1 = prim::TupleUnpack(%a) %2, %3 = prim::TupleUnpack(%b) ``` into: ``` %0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b) ``` The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack` Reviewed By: d1jang Differential Revision: D30872109 fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9	2021-09-20 10:36:11 -07:00
Don Jang	ae00075ac7	[Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123 This change re-reverts D30883290 (`0e11454d19`). D30883290 (`0e11454d19`) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`. ``` ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod); Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 auto sr = getStaticRuntime(); Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 2 errors generated. ``` This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine). Original Summary: This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp. `MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors. This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support. Test Plan: - Confirm that OSS build went well (See External Tests section). Reviewed By: mikeiovine Differential Revision: D30983292 fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116	2021-09-17 13:32:01 -07:00
albanD	473e55d5b2	Use classmethods for overrides (#64841 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64841 Test Plan: Imported from OSS Reviewed By: heitorschueroff Differential Revision: D30991424 Pulled By: albanD fbshipit-source-id: 551e2119768f3a4292713f3bfa83930f5506adbd	2021-09-17 08:32:49 -07:00
Don Jang	8241193d76	[Static Runtime] Introduce static_runtime::dict_unpack (#64771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771 Test Plan: - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict` - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict` - TBD: Perf impact measurement Reviewed By: mikeiovine Differential Revision: D30685083 fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9	2021-09-16 23:25:13 -07:00
Eli Uriegas	03389dc851	Revert D30752939: [pytorch][PR] nvfuser update Test Plan: revert-hammer Differential Revision: D30752939 (`cfaecaf40b`) Original commit changeset: ce122e80f01b fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2	2021-09-15 17:38:47 -07:00
jiej	cfaecaf40b	nvfuser update (#63745 ) Summary: Syncing nvfuser code base from devel branch, Listing a few of our development since last sync: - Extends support to normalization and reduction kernels. - Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation. - profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes). To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle. internal updates are files located in: 1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda` 2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser` 3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h` updates affecting integration: 1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/`, 2. exposed a few more symbols `aten/src/ATen/core/` used by codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745 Reviewed By: saketh-are Differential Revision: D30752939 Pulled By: malfet fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c	2021-09-15 14:42:55 -07:00
Don Jang	3fb33b38b9	[Static Runtime] Check if outputs of a node do not overlap with each other (#63013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013 This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs. This check will detect a problem like T97393697 immediately in debug mode. Test Plan: - Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs` - Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run. Reviewed By: hlu1 Differential Revision: D30211705 fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0	2021-09-15 08:38:05 -07:00
Mikhail Zolotukhin	f23f21dafe	[TensorExpr] Remove 'Placeholder' class. (#64887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887 BufHandle has exactly the same functionality and should be used instead. Differential Revision: D30889483 D30889483 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3	2021-09-14 00:22:44 -07:00
Eddie Ren	9c73a48ecf	ND Embeddings benchmark - Standardize randomized inputs (#64707 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672	2021-09-13 06:47:35 -07:00
Raghavan Raman	2cc9778495	[MicroBench] Added a log_vml version of the signed log1p kernel (#64205 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205 The log_vml version of the micro-bench is over 2x faster than the log1p version. Here are the perf numbers: ``` --------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------- SignedLog1pBench/ATen/10/1467 45915 ns 45908 ns 14506 GB/s=2.5564G/s SignedLog1pBench/NNC/10/1467 40469 ns 40466 ns 17367 GB/s=2.9002G/s SignedLog1pBench/NNCLogVml/10/1467 19560 ns 19559 ns 35902 GB/s=6.00016G/s ``` Thanks to bertmaher for pointing this out. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D30644716 Pulled By: navahgar fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19	2021-09-10 16:49:06 -07:00
Eddie Ren	3fbb49e75d	Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e	2021-09-10 16:49:02 -07:00
Mike Iovine	616fd9219d	[Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209 Add a new fusion pass that turns transforms the following pattern: ``` graph(%input): %0 : Tensor = aten::sign(%input) %1 : Tensor = aten::abs(%input) %2 : Tensor = aten::log1p(%1) %res : Tensor = aten::mul(%0, %2) return (%res) ``` Into a single op: ``` graph(%input): %res : Tensor = static_runtim::signed_log1p(%input) return (%res) ``` The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff does not enable this pass. Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Test passed with new graph pass disabled and enabled. Reviewed By: hlu1 Differential Revision: D30559929 fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a	2021-09-02 08:31:40 -07:00
Ray Peng	09e610e36d	[Static Runtime] Out version for softmax (#64243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243 Test Plan: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ... V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1) ... [ OK ] StaticRuntime.IndividualOps_Softmax (803 ms) ``` Reviewed By: hlu1 Differential Revision: D30656149 fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c	2021-08-31 18:33:26 -07:00
Harut Movsisyan	3c15822f5f	[Static Runtime] Implement aten::nonzero out variant (#64126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126 Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30617729 fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b	2021-08-31 00:51:15 -07:00
Harut Movsisyan	1f16c22dc8	[Static Runtime] Implement aten::cumsum out variant (#64159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30622819 fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736	2021-08-30 16:18:22 -07:00
Harut Movsisyan	e24c3644d8	[Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157 UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR. Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30598574 fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a	2021-08-30 09:42:38 -07:00
Raghavan Raman	dc4fd3bdda	[MicroBench] Added a micro benchmark for a signed log1p kernel. (#64032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64032 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30579198 Pulled By: navahgar fbshipit-source-id: a53d68225fba768b26491d14b535f8f2dcf50c0e	2021-08-30 09:27:51 -07:00
Harut Movsisyan	8af1407eab	[Static Runtime] Out version for torch.linalg.norm (#64070 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30595816 fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563	2021-08-29 21:00:11 -07:00
Don Jang	9f1f22b9bc	[Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081 This change add an out variant of `quantized::embedding_bag_byte_prepack`. Test Plan: - Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`. - Observed ``` V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input) ``` Reviewed By: hlu1 Differential Revision: D30504216 fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2	2021-08-27 10:53:23 -07:00
Harut Movsisyan	f2c47cf4db	[Static Runtime] Out version for fmod (#64046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046 Test Plan: Confirm out variant is used: ``` > //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1) ``` Reviewed By: mikeiovine Differential Revision: D30581228 fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74	2021-08-27 03:05:06 -07:00
Don Jang	c90b3cb1da	[Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078 This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime. Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it: ``` at::Tensor mean = create_empty_from({M}, X); at::Tensor rstd = create_empty_from({M}, X); ``` that the static runtime misses an opportunity to manage. This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors. Test Plan: - Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated. - Confirmed that the new op gets activated during testing: ``` V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3) ``` Reviewed By: hlu1 Differential Revision: D30486475 fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d	2021-08-27 02:44:43 -07:00
Don Jang	cbfec02007	[Static Runtime] Add native op for aten::expand_as (#64024 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024 `aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime. Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs` Reviewed By: hlu1 Differential Revision: D30546851 fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0	2021-08-26 13:05:53 -07:00
Hao Lu	6fa646ad54	[StaticRuntime] Fix bug in HasInplaceOp (#63842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842 Reviewed By: mikeiovine Differential Revision: D30506914 fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852	2021-08-24 17:07:45 -07:00
Harut Movsisyan	956c8fa01e	Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b	2021-08-24 16:26:26 -07:00
Mike Iovine	7774a4e95b	[Static Runtime] Implement prim::VarStack out variant (#63579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579 Provide a static runtime out variant implementation for the new op introduced in D30426232 (`1385f9fb12`). Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack` Reviewed By: navahgar Differential Revision: D30410525 fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8	2021-08-24 09:44:29 -07:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	62d02f2b57	[TensorExpr] Make 'Tensor' a value type. (#63586 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586 This is another commit in transition from KernelArena memory management. Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need to dynamically allocate it at all - it's cheap to pass it by value, and that's what we're switching to in this commit. After this change nothing uses KernelScope/KernelArena and they can be safely removed. Differential Revision: D30429114 D30429114 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819	2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin	dd96c26066	[TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778 This is a preparation for a switch from raw pointers to shared pointers as a memory model for TE expressions and statements. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30487425 Pulled By: ZolotukhinM fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c	2021-08-24 00:30:49 -07:00
Don Jang	84890aae35	[Static Runtime] Add an out variant op for aten::abs (#63675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675 This change adds an out variant implementation for `aten::abs`. Test Plan: - Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)` - Perf impact: TBD Reviewed By: hlu1 Differential Revision: D30461317 fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191	2021-08-23 16:25:10 -07:00
Hao Lu	b2a601ffe5	[Static Runtime] Implement out variant for fb::quantized_linear (#63635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635 Reviewed By: ajyu Differential Revision: D30446234 fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030	2021-08-20 21:42:22 -07:00
Don Jang	913c1f83f4	[Static Runtime] Add native op for aten::detach (#63625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625 This change adds a static runtime's native op implementation for `aten::detach` op. See the standard `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison. Test Plan: - Added `StaticRuntime.IndividualOps_Detach`. - Observed ``` V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1) ``` Reviewed By: hlu1 Differential Revision: D30443187 fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17	2021-08-20 00:46:27 -07:00
Philip Meier	99203580a9	Updates internal `assert_allclose` callsites in favor of `assert_close` (#61841 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841 Redo of #60863. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D30408145 Pulled By: mruberry fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58	2021-08-19 12:50:41 -07:00
Mike Iovine	47a9e8ff32	[Static Runtime] Support __getitem__ for lists (#63398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398 This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter. Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D30368464 fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764	2021-08-19 06:38:51 -07:00

... 2 3 4 5 6 ...

869 Commits