pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Rong Rong (AI Infra)	002ce5c1df	port addmm to structure kernel (#57417 ) Summary: Port addmm to structure kernel Follow ups - migrate `mm` and `addbmm` to structure - move TORCH_CHECKS currently in `addmm_cpu_impl_` and `addmm_out_cuda_impl` to meta Pull Request resolved: https://github.com/pytorch/pytorch/pull/57417 Reviewed By: bdhirsh Differential Revision: D28291001 Pulled By: walterddr fbshipit-source-id: 4eafaa30a465e225fbb4d2a69a36f1e037df9122	2021-05-13 08:33:42 -07:00
Hao Lu	32acc96f78	[Static Runtime] Fix bug in aten::clone (#58100 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58100 aten::clone has a second arg, memory_format, which was not previously supported. Reviewed By: ajyu Differential Revision: D28347171 fbshipit-source-id: e083cc24c3228048429bba3497326415bc3d1f5a	2021-05-11 22:47:25 -07:00
Edvard Ghazaryan	dd876120f9	Out version for aten::repeat (#57683 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57683 Support aten::repeat for static runtime Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D27639482 fbshipit-source-id: e6e706cb1d52750eea74f19536245f0484e945e6	2021-05-11 13:21:58 -07:00
Hao Lu	8bbe383877	[Static Runtime] Fix bugs in logit (#57578 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57578 The original impl in SR assumes that eps is a constant, which is true most of the times. However it could be a graph input as well. This diff fixes this issue. Unit tests are added as well. Reviewed By: edvgha Differential Revision: D28207975 fbshipit-source-id: 9a10dec159f3804e43ef74aaa20c3ec6c79548c9	2021-05-05 23:38:15 -07:00
Hao Lu	5439977352	[Static Runtime] Revamp op schema check (#57521 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521 When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR. This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead: ``` if (node->kind() != prim::ListConstruct && node->kind() != prim::TupleConstruct && node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) { const Operator& op = node->getOperator(); TORCH_CHECK(op.hasOperation()); op_ = op.getOperation(node); VLOG(1) << "Fallback interpreter for node: " << PrintNode(node); } ``` The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode. (Note: this ignores all push blocking failures!) Reviewed By: ajyu Differential Revision: D27531447 fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd	2021-05-04 02:48:04 -07:00
Edvard Ghazaryan	e62cdae469	Static Runtime support for aten::matmul (#57291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57291 aten::matmul support for static runtime Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_Binary_MatMul Reviewed By: hlu1 Differential Revision: D28099671 fbshipit-source-id: 784035060c8c24953df47ca4227d2bca5094da22	2021-04-30 10:49:55 -07:00
Ansha Yu	46321cb937	[static runtime] binding for aten::norm_out (#56636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56636 Test Plan: Test it runs on the aug_1x model, which has aten::norm, and verify jit/sr results ``` ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1 ``` ``` Time per node type: 1.53159 ms. 35.8619%. fb::sigrid_transforms_torch_bind (1 nodes) 0.9481 ms. 22.1996%. aten::linear (6 nodes) 0.704806 ms. 16.5029%. aten::argmin (1 nodes) 0.252252 ms. 5.90643%. aten::matmul (1 nodes) 0.140869 ms. 3.29842%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes) 0.100014 ms. 2.34181%. fb::clip_ranges_gather (263 nodes) 0.0880838 ms. 2.06247%. aten::sub (1 nodes) 0.0553556 ms. 1.29614%. aten::repeat (1 nodes) 0.0438464 ms. 1.02665%. aten::norm (1 nodes) 0.0395956 ms. 0.927124%. fb::batch_box_cox (1 nodes) 0.035834 ms. 0.839045%. aten::__getitem__ (506 nodes) 0.0345233 ms. 0.808357%. prim::TupleUnpack (254 nodes) 0.0316876 ms. 0.741959%. aten::sigmoid (2 nodes) 0.0293246 ms. 0.686629%. aten::mul (3 nodes) 0.0287696 ms. 0.673635%. fb::offsets_to_ranges (253 nodes) 0.0242373 ms. 0.567511%. aten::pow (1 nodes) 0.0224204 ms. 0.52497%. fb::simple_embedding_bag_sum (3 nodes) 0.0200074 ms. 0.468469%. fb::casted_batch_one_hot_lengths (1 nodes) 0.0190264 ms. 0.445499%. fb::concat_add_mul_replacenan_clip (1 nodes) 0.0167253 ms. 0.39162%. prim::TupleConstruct (1 nodes) 0.0164962 ms. 0.386255%. aten::sum (3 nodes) 0.0158986 ms. 0.372262%. prim::DictConstruct (2 nodes) 0.0109372 ms. 0.256093%. aten::div (1 nodes) 0.00910563 ms. 0.213207%. prim::ListConstruct (4 nodes) 0.00876917 ms. 0.205328%. static_runtime::to_copy (8 nodes) 0.00822567 ms. 0.192603%. fb::sigrid_hash_precompute (1 nodes) 0.00622559 ms. 0.145771%. aten::contiguous (1 nodes) 0.00460064 ms. 0.107723%. aten::narrow (4 nodes) 0.00297164 ms. 0.0695804%. static_runtime::reshape_copy (2 nodes) 0.00287099 ms. 0.0672237%. aten::logit (1 nodes) 0.00277557 ms. 0.0649894%. aten::add (1 nodes) 0.00264978 ms. 0.0620441%. aten::clamp_min (1 nodes) 0.00215832 ms. 0.0505366%. aten::relu (1 nodes) 0.00213779 ms. 0.050056%. fb::gather_ranges (4 nodes) 0.00195846 ms. 0.0458571%. aten::full (1 nodes) 0.00177333 ms. 0.0415222%. aten::stack (1 nodes) 0.00147449 ms. 0.034525%. aten::size (3 nodes) 0.000762524 ms. 0.0178544%. aten::expand_as (1 nodes) 0.000757406 ms. 0.0177345%. fb::clip_ranges (2 nodes) 0.000614798 ms. 0.0143954%. fb::lengths_to_offsets (3 nodes) 0.000407952 ms. 0.00955212%. static_runtime::flatten_copy (1 nodes) 0.000159918 ms. 0.00374445%. prim::device (1 nodes) 4.2708 ms. in Total StaticRuntime setup time: 0.000407 ms Memory allocation time: 0.0089714 ms Memory deallocation time: 0.0592135 ms Outputs deallocation time: 0.0458097 ms Total memory managed: 947328 bytes Total number of reused tensors: 28 ``` Reviewed By: hlu1 Differential Revision: D27922070 fbshipit-source-id: 538b39b7fff0638fc994b7983bf32d9e9f15d016	2021-04-28 08:44:10 -07:00
Edvard Ghazaryan	cea265b8d8	Support layer_norm for static runtime (#56444 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56444 Added out version for layer_norm Test Plan: buck test caffe2/aten:math_kernel_test -- NativeLayerNorm buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D27873846 fbshipit-source-id: 53ee9fec4ff9a4e78198b031e86b5afd013626dd	2021-04-27 12:28:37 -07:00
Hao Lu	e810bed63f	[Static Runtime] Clean up op implementations (#56841 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56841 - Move arg checks to outside the lambda so we can perform these checks at Static Runtime initialization time - use `optional` where possible - support `to.other` overload, the 5-arg input load of `torch.to`. Test Plan: ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/opt-clang //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled ``` Reviewed By: edvgha Differential Revision: D27933176 fbshipit-source-id: 49d6249c8784c44146461e286e7a301596172d7c	2021-04-26 15:37:39 -07:00
Ansha Yu	690c8b434f	[static runtime] binding for aten::sub_out (#56656 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56656 Test Plan: ``` ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1 ``` ``` Time per node type: 1.85766 ms. 35.7817%. fb::sigrid_transforms_torch_bind (1 nodes) 1.1238 ms. 21.6464%. aten::linear (6 nodes) 0.858116 ms. 16.5288%. aten::argmin (1 nodes) 0.334183 ms. 6.43694%. aten::matmul (1 nodes) 0.173697 ms. 3.3457%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes) 0.118827 ms. 2.28881%. fb::clip_ranges_gather (263 nodes) 0.101348 ms. 1.95215%. aten::sub (1 nodes) 0.0748209 ms. 1.44118%. aten::repeat (1 nodes) 0.0582576 ms. 1.12214%. aten::norm (1 nodes) 0.0474353 ms. 0.913686%. fb::batch_box_cox (1 nodes) 0.0457588 ms. 0.881393%. aten::__getitem__ (506 nodes) 0.0435175 ms. 0.838222%. prim::TupleUnpack (254 nodes) 0.0425416 ms. 0.819425%. aten::sigmoid (2 nodes) 0.0383822 ms. 0.739308%. fb::offsets_to_ranges (253 nodes) 0.0330187 ms. 0.635996%. aten::mul (3 nodes) 0.027534 ms. 0.530352%. fb::simple_embedding_bag_sum (3 nodes) 0.0274914 ms. 0.529532%. aten::pow (1 nodes) 0.0236733 ms. 0.455989%. fb::casted_batch_one_hot_lengths (1 nodes) 0.023348 ms. 0.449723%. fb::concat_add_mul_replacenan_clip (1 nodes) 0.0193511 ms. 0.372735%. aten::sum (3 nodes) 0.0188839 ms. 0.363737%. prim::DictConstruct (2 nodes) 0.0183191 ms. 0.352858%. prim::TupleConstruct (1 nodes) 0.0119029 ms. 0.22927%. aten::div (1 nodes) 0.0103263 ms. 0.198902%. static_runtime::to_copy (8 nodes) 0.00977658 ms. 0.188314%. prim::ListConstruct (4 nodes) 0.00924042 ms. 0.177986%. fb::sigrid_hash_precompute (1 nodes) 0.00692162 ms. 0.133322%. aten::contiguous (1 nodes) 0.00567485 ms. 0.109307%. aten::narrow (4 nodes) 0.00362285 ms. 0.0697823%. aten::logit (1 nodes) 0.00329995 ms. 0.0635627%. aten::add (1 nodes) 0.00285633 ms. 0.0550178%. aten::full (1 nodes) 0.00268469 ms. 0.0517118%. fb::gather_ranges (4 nodes) 0.00248577 ms. 0.0478803%. aten::stack (1 nodes) 0.00241782 ms. 0.0465715%. aten::relu (1 nodes) 0.00233674 ms. 0.0450096%. aten::clamp_min (1 nodes) 0.00222238 ms. 0.0428068%. static_runtime::reshape_copy (2 nodes) 0.00171177 ms. 0.0329716%. aten::size (3 nodes) 0.00120008 ms. 0.0231155%. aten::expand_as (1 nodes) 0.00112628 ms. 0.0216942%. fb::clip_ranges (2 nodes) 0.00103193 ms. 0.0198768%. fb::lengths_to_offsets (3 nodes) 0.000598624 ms. 0.0115305%. static_runtime::flatten_copy (1 nodes) 0.000236196 ms. 0.00454954%. prim::device (1 nodes) 5.19164 ms. in Total StaticRuntime setup time: 0.000868 ms Memory allocation time: 0.0109619 ms Memory deallocation time: 0.071791 ms Outputs deallocation time: 0.0560187 ms Total memory managed: 1232320 bytes Total number of reused tensors: 32 W0421 17:40:52.053653 1746499 PyTorchPredictorContainer.cpp:200] Failed to load metadata file W0421 17:40:52.053757 1746499 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config I0421 17:40:52.053779 1746499 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1 I0421 17:40:52.185776 1746499 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 131.985. Iters per second: 7.57661 I0421 17:40:52.337853 1746499 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results ``` Reviewed By: hlu1 Differential Revision: D27929253 fbshipit-source-id: 5a7984ba3ce2d6d4bce0a0ab6c5e09e8c037b44e	2021-04-22 08:40:35 -07:00
Ansha Yu	81b59211d4	[static runtime] binding for aten::div_out (#56653 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56653 Test Plan: ``` ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1 ``` ``` Time per node type: 1.48563 ms. 35.9861%. fb::sigrid_transforms_torch_bind (1 nodes) 0.92385 ms. 22.3783%. aten::linear (6 nodes) 0.681066 ms. 16.4974%. aten::argmin (1 nodes) 0.239311 ms. 5.79679%. aten::matmul (1 nodes) 0.140157 ms. 3.39501%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes) 0.0951568 ms. 2.30497%. fb::clip_ranges_gather (263 nodes) 0.0835801 ms. 2.02455%. aten::sub (1 nodes) 0.054081 ms. 1.31%. aten::repeat (1 nodes) 0.0424465 ms. 1.02818%. aten::norm (1 nodes) 0.0389049 ms. 0.942389%. fb::batch_box_cox (1 nodes) 0.0346992 ms. 0.840514%. aten::__getitem__ (506 nodes) 0.0341335 ms. 0.82681%. prim::TupleUnpack (254 nodes) 0.0306839 ms. 0.743252%. aten::sigmoid (2 nodes) 0.0280489 ms. 0.679426%. aten::mul (3 nodes) 0.0265321 ms. 0.642684%. fb::offsets_to_ranges (253 nodes) 0.0207622 ms. 0.50292%. aten::pow (1 nodes) 0.0202067 ms. 0.489465%. fb::simple_embedding_bag_sum (3 nodes) 0.0195497 ms. 0.47355%. fb::casted_batch_one_hot_lengths (1 nodes) 0.0184351 ms. 0.446551%. fb::concat_add_mul_replacenan_clip (1 nodes) 0.016382 ms. 0.39682%. aten::sum (3 nodes) 0.0158651 ms. 0.384299%. prim::TupleConstruct (1 nodes) 0.0150918 ms. 0.365567%. prim::DictConstruct (2 nodes) 0.00858005 ms. 0.207833%. aten::div (1 nodes) 0.00810684 ms. 0.196371%. fb::sigrid_hash_precompute (1 nodes) 0.00796325 ms. 0.192893%. static_runtime::to_copy (8 nodes) 0.00782038 ms. 0.189432%. prim::ListConstruct (4 nodes) 0.0057504 ms. 0.139291%. aten::contiguous (1 nodes) 0.0044688 ms. 0.108247%. aten::narrow (4 nodes) 0.00284054 ms. 0.068806%. aten::logit (1 nodes) 0.00265049 ms. 0.0642024%. aten::add (1 nodes) 0.00216242 ms. 0.05238%. aten::full (1 nodes) 0.00207732 ms. 0.0503187%. aten::relu (1 nodes) 0.00198412 ms. 0.048061%. fb::gather_ranges (4 nodes) 0.00176954 ms. 0.0428632%. aten::stack (1 nodes) 0.00175913 ms. 0.0426112%. static_runtime::reshape_copy (2 nodes) 0.0016996 ms. 0.0411692%. aten::clamp_min (1 nodes) 0.00128528 ms. 0.0311331%. aten::size (3 nodes) 0.000849156 ms. 0.020569%. aten::expand_as (1 nodes) 0.000757672 ms. 0.018353%. fb::clip_ranges (2 nodes) 0.000596224 ms. 0.0144423%. fb::lengths_to_offsets (3 nodes) 0.000442632 ms. 0.0107218%. static_runtime::flatten_copy (1 nodes) 0.000196158 ms. 0.00475151%. prim::device (1 nodes) 4.12833 ms. in Total StaticRuntime setup time: 0.000451 ms Memory allocation time: 0.0089336 ms Memory deallocation time: 0.0578358 ms Outputs deallocation time: 0.0431742 ms Total memory managed: 947328 bytes Total number of reused tensors: 31 W0421 16:56:34.220682 1522800 PyTorchPredictorContainer.cpp:200] Failed to load metadata file W0421 16:56:34.220772 1522800 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config I0421 16:56:34.220791 1522800 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1 I0421 16:56:34.366667 1522800 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 145.863. Iters per second: 6.85573 I0421 16:56:34.514202 1522800 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results ``` Reviewed By: hlu1 Differential Revision: D27927731 fbshipit-source-id: 595883a31ba0cadf6449799d47bf2294a1d05b41	2021-04-22 01:38:24 -07:00
Hao Lu	33f206b865	[StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447 MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change. Test Plan: CI Reviewed By: ajyu Differential Revision: D27840361 fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29	2021-04-20 23:04:01 -07:00
Peng Wu	1a116a9332	[Static runtime] Add optimize_graph_output_memory flag (#55811 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811 - Added manage_graph_output_memory flag to opts (default false) - Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory - Minor refactoring for readability Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime Reviewed By: hlu1 Differential Revision: D27573780 fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91	2021-04-14 15:41:18 -07:00
Mikhail Zolotukhin	7ab53eb960	[StaticRuntime] Unbreak benchmarks. (#55199 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55199 Test Plan: Imported from OSS Reviewed By: walterddr, hlu1 Differential Revision: D27526600 Pulled By: ZolotukhinM fbshipit-source-id: 9318cb5d6adca3e8073f8ec4219afc3cc1c75f7c	2021-04-02 12:03:56 -07:00
Hao Lu	46e7f6773f	[Static Runtime] Check for inplace ops explicitly in ReplaceWithCopy (#54657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54657 The constraint checked in D27145406 (`acf03b13f1`) is too tight for the adindexer model and as a result, 5 ops (4 aten::narrow + 1 aten::premute) are not replaced with the copy version and resulted in perf regression. This diff checks for inplace ops explicitly and only applies the input constraint to graphs with inplace ops. Test Plan: Contbuild Reviewed By: ajyu Differential Revision: D27253145 fbshipit-source-id: 23e2b1a018c84dd0fc2880fddd9c41bc0422b8eb	2021-03-30 07:08:00 -07:00
Wenlei Xie	53596cdb73	Remove hacky wrapper for about 100 kernels (#54367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367 Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098 ghstack-source-id: 124804544 Test Plan: buck build //caffe2/aten/... Reviewed By: smessmer Differential Revision: D27210057 fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d	2021-03-25 10:00:16 -07:00
Hao Lu	52abd3bd7b	[Static Runtime] Fix bug in reshape_copy (#54467 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54467 `at::native::copy_` requires src/dest to have the same sizes, which isn't true in reshape. Test Plan: Added new test cases to cover this case. Reviewed By: ajyu Differential Revision: D27249617 fbshipit-source-id: 2c95175fa8564b3c648979445ad4314f97818852	2021-03-22 22:20:55 -07:00
Hao Lu	8294bff20d	[StaticRuntime] Copy version of reshape/flatten (#54353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353 The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor. Reviewed By: ajyu, edvgha Differential Revision: D26358525 fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87	2021-03-20 16:55:30 -07:00
Brian Hirsh	bc4f521178	port at::mul to structured (#52692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692 Porting `at::mul` to structured. One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. Technically, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D27029822 Pulled By: bdhirsh fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4	2021-03-19 11:34:33 -07:00
Hao Lu	acf03b13f1	[Static Runtime] Check for number of uses of op inputs > 1 in ReplaceWithCopy (#54230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54230 The comments in the code explained why this change is needed. Reviewed By: bwasti Differential Revision: D27145406 fbshipit-source-id: 2a61a42f22dfadfad59ee6c3be3e9e9d19e90ac3	2021-03-18 20:02:20 -07:00
Edvard Ghazaryan	ce0fd095a8	Implemented embedding_bag for SR (#52429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52429 Implemented embedding_bag for supporting out version in SR Befor:Milliseconds per iter: 1.15443. Iters per second: 866.226 After: Milliseconds per iter: 1.14791. Iters per second: 871.149 Test Plan: buck test caffe2/test:nn buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D26089498 fbshipit-source-id: c9ba7068d5aa696c8f37a4846d8e80c6379538d2	2021-03-12 17:52:27 -08:00
Bram Wasti	97460d3545	[static runtime] Minimum fusion group size (#50217 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50217 If we fuse small groups, things are slow Test Plan: buck test //caffe2/test:static_runtime Reviewed By: bertmaher Differential Revision: D25643460 fbshipit-source-id: d2f39a4d612df3e1e29362abb23c2d997202f6ea	2021-03-08 19:06:16 -08:00
Bram Wasti	56f8379802	[static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564 Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime. This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions. ``` graph, schema = PrepareForStaticModule(torchscript_module) sm = StaticModule(graph, schema, options) sm(inputs) // or create many cheap runtimes with the module sr = StaticRuntime(sm) sr(inputs) ``` Changelist: - Rename InferenceModule StaticModule - Move all logic for construction into StaticModule - Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule) - Update comments with explanation - Propagate all changes to predictor integration - Propagate all changes to python integration - Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters). Test Plan: buck test //caffe2/test:static_runtime buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D25592967 fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f	2021-03-05 10:15:26 -08:00
Ansha Yu	36180c1322	[static runtime] aten::to copy out variant (#52343 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52343 aten::to returns self when the TensorOptions match and copy is set to false. For static runtime, always copy. There isn't a separate op for aten::to copy, but instead the same function with different arguments. Test Plan: On AdFinder local_ro: Before: 0.896742 0.00824827 ms. 0.92773%. aten::to (5 nodes) After: 0.88233 0.0056607 ms. 0.644675%. aten::to (5 nodes) buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D26477980 fbshipit-source-id: 8e8448092adff38c141af1ce27a10acd39c07dd1	2021-03-04 17:30:15 -08:00
Ansha Yu	d98839e53e	[static runtime] register pow out variant (#52454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52454 Test Plan: adfinder local net Before: 7.13307 ms/iter 0.0222672 ms. 0.311136%. aten::pow (1 nodes) After: 7.10623 ms/iter 0.0174462 ms. 0.242774%. aten::pow (1 nodes) Reviewed By: malfet, hlu1 Differential Revision: D26521717 fbshipit-source-id: 8d9279b59d37c8786a9eeccd0f54bd84c400c128	2021-03-03 21:33:11 -08:00
Bram Wasti	d4e64dad15	[static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684 With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations. `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations. Ideally every node will run out of place every time. Afaik, there's never a reason to disable it and we may want to remove that functionality. This diff does introduce a "leak" in the memory management - containers are not cleaned up. This only happens when out variants are enabled Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled Reviewed By: maratsubkhankulov, hlu1 Differential Revision: D26515801 fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78	2021-03-02 09:55:25 -08:00
Bram Wasti	2d67b76fa6	[static runtime] Add Alias analysis to Memory Management/Planning (#50060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060 Aliasing is currently mishandled in SR. This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff. However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward. This diff ``` $ batch=20 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69 PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04 $ batch=20 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41 PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9 PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799 PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036 ``` Test Plan: buck test //caffe2/test:static_runtime buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: eellison Differential Revision: D25581156 fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b	2021-03-02 09:53:32 -08:00
Bram Wasti	a0652c8f08	[static runtime] Fix up deprecated exact equality in tests (#52617 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52617 swaps `.equals` with `torch::allclose` tests are broken right now Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled Reviewed By: bertmaher, maratsubkhankulov, yinghai Differential Revision: D26585079 fbshipit-source-id: 9bd2a7b87208301415a8925f95c69fe44accf159	2021-02-22 17:50:14 -08:00
Hao Lu	72f9b3c8d5	[StaticRuntime] Add function to check for memory leak (#52342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52342 Reviewed By: yinghai Differential Revision: D26420826 fbshipit-source-id: 4023f80fadd21e192afa485d96acd37c845146be	2021-02-19 19:45:09 -08:00
Edvard Ghazaryan	b887c30980	Out version for sum (#52225 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52225 Supported out version for sum for SR Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest sum node runtime before out version (1000 time run): 3558us sum node runtime after out version (1000 time run): 2173 us Reviewed By: ajyu Differential Revision: D26259744 fbshipit-source-id: bc6a1231353d79a96d45f1cdc676e78a92469d85	2021-02-16 12:01:02 -08:00
Hao Lu	4949eea0ff	[StaticRuntime] Clean up output references and remove dead code (#52237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237 Redo D26331506 (`4c58be4573`). Get rid of `nodiscard` which broke OSS CI. - Clean up references of outputs, including Tuples/Lists, by using move semantics - Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs. - Modify unit tests to check for use_counts of outputs - Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem. This diff does not try to fix the alias problem with the MemoryPlanner. Reviewed By: swolchok Differential Revision: D26432539 fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411	2021-02-13 20:05:28 -08:00
Mike Ruberry	992d251c39	Revert D26333953: [StaticRuntime] Clean up output references and remove dead code Test Plan: revert-hammer Differential Revision: D26333953 (`0c9d72b5e1`) Original commit changeset: cadc0595ad6a fbshipit-source-id: 75d0b33099342653cd8867b129139325789aee6c	2021-02-12 02:12:31 -08:00
Hao Lu	0c9d72b5e1	[StaticRuntime] Clean up output references and remove dead code (#51991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991 - Clean up references of outputs, including Tuples/Lists, by using move semantics - Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs. - Modify unit tests to check for use_counts of outputs - Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem. This diff does not try to fix the alias problem with the MemoryPlanner. (Note: this ignores all push blocking failures!) Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test ``` Reviewed By: bwasti Differential Revision: D26333953 fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f	2021-02-12 01:11:08 -08:00
Hao Lu	4c58be4573	[StaticRuntime] Clean up input references (#51952 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952 StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check. Will clean up output tensors in separate diffs. Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test ``` Reviewed By: bwasti Differential Revision: D26331506 fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b	2021-02-11 13:46:19 -08:00
Hao Lu	11cda929fb	[StaticRuntime] Fix bug in MemoryPlanner (#51342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342 There is a subtle bug with the MemoryPlanner with regard to view ops with out variant. ``` def forward(self, a: Tensor, shape: List[int]): b = a.reshape(shape) return b + b ``` In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const. To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part. Test Plan: Add unit test to enforce the constness of inputs ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: ajyu Differential Revision: D26144203 fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3	2021-01-29 21:16:02 -08:00
Hao Lu	d035d56bfb	[StaticRuntime] Add out variant for reshape and flatten (#51249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249 - Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case. - Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately. - The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor. Reviewed By: ajyu Differential Revision: D25992202 fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d	2021-01-27 22:44:11 -08:00
Bram Wasti	f4226b5c90	[static runtime] add static subgraph fusion pass (#49185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185 This diff adds a fusion feature that will let us use static runtime for parts of the graph. This will prove useful in cases where fully eliminating control flow is hard etc. TODO: [x] factor out into separate fusion file [x] add python test case [x] add graph that isn't fully lowered test case [x] add graph that has weird list/tuple outputs test case the loop example looks quite good: ``` graph(%a.1 : Tensor, %b.1 : Tensor, %iters.1 : int): %12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4 %c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1) %c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4 block0(%i : int, %c.12 : Tensor): %c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1) -> (%12, %c.10) return (%c) with prim::StaticSubgraph_0 = graph(%0 : Tensor, %4 : Tensor): %5 : int = prim::Constant[value=2]() %6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12 %2 : int = prim::Constant[value=1]() %c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8 return (%c.2) with prim::StaticSubgraph_1 = graph(%1 : Tensor, %7 : Tensor, %8 : Tensor): %9 : int = prim::Constant[value=1]() %c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12 %5 : int = prim::Constant[value=2]() %c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8 %2 : int = prim::Constant[value=1]() %c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8 return (%c.10) ``` (Note: this ignores all push blocking failures!) Test Plan: buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/no-gpu caffe2/test:static_runtime Reviewed By: bertmaher Differential Revision: D25385702 fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098	2020-12-10 14:03:11 -08:00
Edward Yang	16b8e6ab01	Class-based structured kernels, with migration of add to framework (#48718 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48718 This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check https://github.com/pytorch/rfcs/pull/9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Fix `verify_api_visibility.cpp` * Refactor tools/codegen/gen.py for clarity * Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D25278031 Pulled By: ezyang fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b	2020-12-09 15:39:12 -08:00
Hao Lu	c5dae335e4	[PT][StaticRuntime] Move prim op impl to ops.cpp (#48210 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48210 - Move prim op implementation from `ProcessedNode::run` to `getNativeOperation` - Add out variant for `prim::listConstruct` Test Plan: ``` buck test //caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test buck run mode/dev //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- \ --scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \ --pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \ --iters=1 --warmup_iters=1 --num_threads=1 --pt_enable_static_runtime=true \ --pt_cleanup_activations=true --pt_enable_out_variant=true ``` Reviewed By: ajyu Differential Revision: D24748947 fbshipit-source-id: 12caeeae87b69e60505a6cea31786bd96f5c8684	2020-11-18 23:07:39 -08:00
Bram Wasti	cb046f7bd2	[static runtime] Initial memonger (#47759 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47759 Parity reached :) /0 -> no memonger /1 -> memonger on We can see that the impact is large when activations don't all fit in cache (6x speed up on this micro bench) ``` BM_long_static_memory_optimization/2/0 8563 ns 8559 ns 86370 BM_long_static_memory_optimization/8/0 8326 ns 8322 ns 84099 BM_long_static_memory_optimization/32/0 11446 ns 11440 ns 56107 BM_long_static_memory_optimization/512/0 6116629 ns 6113108 ns 128 BM_long_static_memory_optimization/2/1 8151 ns 8149 ns 87000 BM_long_static_memory_optimization/8/1 7905 ns 7902 ns 85124 BM_long_static_memory_optimization/32/1 10652 ns 10639 ns 66055 BM_long_static_memory_optimization/512/1 1101415 ns 1100673 ns 641 ``` TODO: [x] implementation [x] enable/disable flag [x] statistics about memory saved [x] additional models Test Plan: ``` buck test //caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Reviewed By: yinghai Differential Revision: D24824445 fbshipit-source-id: db1f5239f72cbd1a9444017e20d5a107c3b3f043	2020-11-17 13:55:49 -08:00
Katy Voor	fe7d1d7d0e	Add LeakyReLU operator to static runtime (#47798 ) Summary: - Add LeakyReLU operator to static runtime - Add LeakyReLU benchmark - Add LeakyReLU correctness test case Static Runtime ``` ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_leaky_relu/1 4092 ns 4092 ns 172331 BM_leaky_relu/8 4425 ns 4425 ns 158434 BM_leaky_relu/20 4830 ns 4830 ns 145335 BM_leaky_relu_const/1 3545 ns 3545 ns 198054 BM_leaky_relu_const/8 3825 ns 3825 ns 183074 BM_leaky_relu_const/20 4222 ns 4222 ns 165999 ``` Interpreter ``` ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_leaky_relu/1 7183 ns 7182 ns 96377 BM_leaky_relu/8 7580 ns 7580 ns 91588 BM_leaky_relu/20 8066 ns 8066 ns 87183 BM_leaky_relu_const/1 6466 ns 6466 ns 107925 BM_leaky_relu_const/8 7063 ns 7063 ns 98768 BM_leaky_relu_const/20 7380 ns 7380 ns 94564 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798 Reviewed By: ezyang Differential Revision: D24927043 Pulled By: kavoor fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b	2020-11-13 22:05:52 -08:00
Hao Lu	996f444c00	[pt][static_runtime] Memory model (#46896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896 The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants. As a result, I have to make the following adjustments: 1) remove tensors in output Tuples from internal blob list; 2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning; 3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage Risk: PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk. https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23 Test Plan: ``` buck test //caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Benchmarks: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \ buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \ --pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \ --iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \ --pt_cleanup_activations=true --pt_enable_out_variant=false ``` \|pt_cleanup_activations \|pt_enable_out_variant \|old ms/iter \|new ms/iter \| \|--- \|--- \|--- \|--- \| \|0 \|0 \|0.31873 \|0.30228 \| \|0 \|1 \|0.30018 \|0.29184 \| \|1 \|0 \|0.35246 \|0.31895 \| \|1 \|1 \|0.35742 \|0.30417 \| Reviewed By: bwasti, raziel Differential Revision: D24471854 fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c	2020-11-03 23:47:59 -08:00
Hao Lu	1a3ea46dbf	[StaticRuntime] Threading model (#46219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219 - Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`: ``` struct InferenceModule { explicit InferenceModule(const torch::jit::Module& m); explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g); torch::jit::Module module; std::shared_ptr<torch::jit::Graph> graph; std::unique_ptr<c10::FunctionSchema> schema; std::unordered_map<Value*, size_t> value_to_reg; std::vector<size_t> input_regs; // inputs to the graph std::vector<size_t> output_regs; // outputs of the graph std::vector<size_t> internals; }; ``` which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime: ``` mutable std::vector<IValue> reg_; // The nodes we need to run std::vector<ProcessedNode> nodes_; ``` `reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`. - Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example. - Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable. This diff focuses on threading model only. Benchmarks will be separate. Reviewed By: bwasti Differential Revision: D24237078 fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1	2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin	e5ed037529	[StaticRuntime] Add a 'speed of light' benchmark. (#46308 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308 This PR adds a hand optimized version of DeepAndWide model with the goal of estimating overheads of static runtime. While static runtime is currently much faster than the existing JIT interpreter, it would be useful to understand how close we are to an absolutely 0-overhead system. Currently, this "ideal" implementation is 2x faster than the static runtime on batchsize=1. Full benchmark results: ``` Running build/bin/static_runtime_bench Run on (24 X 2394.71 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_deep_wide_base/1 59518 ns 59500 ns 10909 BM_deep_wide_base/8 74635 ns 74632 ns 9317 BM_deep_wide_base/20 82186 ns 82147 ns 9119 BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835 BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529 BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769 BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309 BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199 BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010 BM_deep_wide_static/1 40324 ns 40318 ns 17232 BM_deep_wide_static/8 50327 ns 50319 ns 13335 BM_deep_wide_static/20 53075 ns 53071 ns 12855 BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008 ``` PS: The implementation could probably be optimized even more. Differential Revision: D24300702 Test Plan: Imported from OSS Reviewed By: dzhulgakov Pulled By: ZolotukhinM fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e	2020-10-19 23:35:55 -07:00
Hao Lu	2b48dd168d	[StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640 Reviewed By: dzhulgakov Differential Revision: D23996656 fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5	2020-10-02 23:03:05 -07:00
Bram Wasti	87b356d093	[static runtime] Split out graph preparation from runtime (#44131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604305 Pulled By: bwasti fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6	2020-09-28 13:01:23 -07:00
Bram Wasti	e5f6e5af13	Add Deep and wide to test and flatten/tranpose for good measure (#44129 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604302 Pulled By: bwasti fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12	2020-09-25 11:05:41 -07:00
Bram Wasti	d1a11618f5	[static runtime] Add _out variants and reuse memory (#44128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604304 Pulled By: bwasti fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9	2020-09-25 11:03:06 -07:00
Bram Wasti	6512032699	[Static Runtime] Add OSS build for static runtime benchmarks (#43881 ) Summary: Adds CMake option. Build with: ``` BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881 Reviewed By: hlu1 Differential Revision: D23430708 Pulled By: bwasti fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec	2020-09-02 08:00:18 -07:00
Hao Lu	8538a79bfe	[jit][static] Basic executor (#43647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647 Nothing fancy, just a basic implementation of the graph executor without using stack machine. Reviewed By: bwasti Differential Revision: D23208413 fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356	2020-08-28 23:20:07 -07:00
Hao Lu	8864148823	[jit] DeepAndWide benchmark (#43096 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096 Add benchmark script for deep and wide model. Reviewed By: bwasti, yinghai Differential Revision: D23099925 fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b	2020-08-15 01:27:12 -07:00

... 2 3 4 5 6

251 Commits