pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Richard Barnes	3979cb0656	irange for size_t (#55320 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27572577 fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03	2021-06-03 01:04:13 -07:00
Hao Lu	c00eefb6c7	[Static Runtime] Clean up and fix bugs in Static Runtime (#58829 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829 - Delete copying and moving of MemoryPlanner. - Remove `inline` in some of the member functions because member functions implemented in classes are inline by default. - Clean up ad update comments. - Reorganize some code Reviewed By: edvgha Differential Revision: D28555476 fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677	2021-05-24 19:46:58 -07:00
Edvard Ghazaryan	a7f06e1e55	Added statistic related to out variant nodes Summary: added more statistic info for static runtime Test Plan: caffe2/benchmarks/static_runtime:static_runtime_cpptest Expected output example: Static runtime ms per iter: 0.939483. Iters per second: 1064.41 Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4) Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma) Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6) Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7) Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1) Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8) Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1) Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4) Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4) Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1) Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23) Time per node type: 0.195671 ms. 23.0483%. aten::add (1 nodes) 0.169457 ms. 19.9605%. aten::mul (1 nodes, out variant) 0.123695 ms. 14.5702%. aten::addmm (1 nodes, out variant) 0.118218 ms. 13.925%. aten::clamp (1 nodes, out variant) 0.0860747 ms. 10.1388%. aten::bmm (1 nodes, out variant) 0.0707332 ms. 8.33175%. aten::cat (1 nodes, out variant) 0.038814 ms. 4.57195%. aten::transpose (1 nodes) 0.0309244 ms. 3.64263%. aten::sigmoid (1 nodes, out variant) 0.0102666 ms. 1.20932%. static_runtime::flatten_copy (1 nodes, out variant) 0.0046297 ms. 0.545338%. prim::TupleConstruct (1 nodes, out variant) 0.000476333 ms. 0.0561079%. prim::ListConstruct (1 nodes, out variant) 0.848959 ms. in Total StaticRuntime setup time: 0.018925 ms Memory allocation time: 0.019808 ms Memory deallocation time: 0.0120445 ms Outputs deallocation time: 0.0864947 ms Total memory managed: 19328 bytes Total number of reused tensors: 3 Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%) Reviewed By: hlu1 Differential Revision: D28553029 fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4	2021-05-20 13:57:07 -07:00
Ansha Yu	eb1ffa91d8	[pyper] allow static runtime on and glow on simultaneously (#57972 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57972 Allow static runtime to be on when glow is on. This should be fine as long as glow AOT has already been run. Test Plan: Test on replayer with remote_other net. D28291326 fixes remaining issue removing loops from the remote_other model. Need to test on regenerated model. Reviewed By: hlu1 Differential Revision: D28275514 fbshipit-source-id: ee78972660dfdc3fcfb9af2cf7ebb19ee745a4f1	2021-05-11 12:24:07 -07:00
CodemodService FBSourceClangFormatLinterBot	cbfce376a8	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D28319469 fbshipit-source-id: 8295597a8ee16b2fef3f7aacdd6c892cb22db988	2021-05-10 03:39:31 -07:00
Nikita Shulga	3a66a1cb99	[clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841 ) Summary: Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy Remove existing nolint warnings using following script: ``` for file in `git ls-files \| grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i $file; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841 Reviewed By: samestep Differential Revision: D28295045 Pulled By: malfet fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163	2021-05-07 20:02:33 -07:00
Hao Lu	5439977352	[Static Runtime] Revamp op schema check (#57521 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521 When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR. This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead: ``` if (node->kind() != prim::ListConstruct && node->kind() != prim::TupleConstruct && node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) { const Operator& op = node->getOperator(); TORCH_CHECK(op.hasOperation()); op_ = op.getOperation(node); VLOG(1) << "Fallback interpreter for node: " << PrintNode(node); } ``` The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode. (Note: this ignores all push blocking failures!) Reviewed By: ajyu Differential Revision: D27531447 fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd	2021-05-04 02:48:04 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Edvard Ghazaryan	a09bbe73fd	static runtime support for fb::equally_split (#56812 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812 fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split. So fb::equal_split will have as many outputs as ListUnpack . Test Plan: buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op Reviewed By: hlu1 Differential Revision: D27974999 fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35	2021-04-26 20:18:09 -07:00
Hao Lu	e4efc0c948	[Static Runtime] Enable check_for_memory_leak in StaticRuntime::benchmark (#56839 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56839 Enable check_for_memory_leak at the end of StaticRuntime::benchmark so this code is exercised more often. Test Plan: Checked with adindexer merge net model Reviewed By: edvgha Differential Revision: D27417911 fbshipit-source-id: 5248942dc439fcc7301ffb0005da76374939fa96	2021-04-23 19:54:58 -07:00
Xiaodong Wang	ed0a0c3578	Revert D27902824: static runtime support for fb::equally_split Test Plan: revert-hammer Differential Revision: D27902824 (`a4e47ea152`) Original commit changeset: 7855047c3bd4 fbshipit-source-id: a46834418ce98826871cd604d1a01f0ff8f23d7f	2021-04-23 10:03:12 -07:00
Edvard Ghazaryan	a4e47ea152	static runtime support for fb::equally_split (#56565 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565 fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split. So fb::equal_split will have as many outputs as ListUnpack . Test Plan: buck test caffe2/torch/fb/sparsenn:fb_operators_test buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op Reviewed By: hlu1 Differential Revision: D27902824 fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46	2021-04-23 00:12:54 -07:00
Hao Lu	33f206b865	[StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447 MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change. Test Plan: CI Reviewed By: ajyu Differential Revision: D27840361 fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29	2021-04-20 23:04:01 -07:00
Peng Wu	1a116a9332	[Static runtime] Add optimize_graph_output_memory flag (#55811 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811 - Added manage_graph_output_memory flag to opts (default false) - Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory - Minor refactoring for readability Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime Reviewed By: hlu1 Differential Revision: D27573780 fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91	2021-04-14 15:41:18 -07:00
Peng Wu	18662d4321	[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime' Reviewed By: bwasti Differential Revision: D27411416 fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c	2021-04-13 11:04:47 -07:00
Ailing Zhang	c6d9ca0c2b	[reland]Replace AutoNonVariableTypeMode with InferenceMode in static runtime. (#55731 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55731 Forgot to export the diff in my last one. Retry... Test Plan: https://www.internalfb.com/intern/aibench/details/3752129704 https://www.internalfb.com/intern/aibench/details/1306815519 Reviewed By: hlu1 Differential Revision: D27694660 fbshipit-source-id: b351338fa789b9e9c7337df9b1bc1bc0fc387f5d	2021-04-12 09:48:20 -07:00
Ailing Zhang	5a8cdc2fdb	Revert D27691509: Replace AutoNonVariableTypeMode with InferenceMode in static runtime. Test Plan: revert-hammer Differential Revision: D27691509 (`d695ba94f6`) Original commit changeset: d43db028a399 fbshipit-source-id: 8cfa2f821ef3251b323483691672ed70858d9d68	2021-04-09 20:36:20 -07:00
Ailing Zhang	d695ba94f6	Replace AutoNonVariableTypeMode with InferenceMode in static runtime. Test Plan: https://www.internalfb.com/intern/aibench/details/3752129704 https://www.internalfb.com/intern/aibench/details/1306815519 Reviewed By: hlu1 Differential Revision: D27691509 fbshipit-source-id: d43db028a399bb02166a539577f6922237145f83	2021-04-09 20:04:00 -07:00
Mike Ruberry	c0ac0fef4e	Revert D27448156: irange for size_t Test Plan: revert-hammer Differential Revision: D27448156 (`041b4431b2`) Original commit changeset: 585da57d4de9 fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365	2021-04-03 19:14:00 -07:00
Richard Barnes	041b4431b2	irange for size_t (#55163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163 Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D27448156 fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1	2021-04-02 23:22:29 -07:00
Peng Wu	fe2c1268b7	More name refactoring of memory planning codes to make it more readable (#54272 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54272 Test Plan: Imported from OSS Reviewed By: bwasti Differential Revision: D27233881 fbshipit-source-id: f257f16ac0684df055961e539f17d002cb8f1bfe	2021-03-24 19:52:35 -07:00
Ansha Yu	afe339d7dd	[static runtime] support DictConstruct (#54438 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438 August 1x model has DictConstruct in the graph (P331168321) These can be easily removed with jit pass, but to easily measure the improvement and run replayer with the model in the meantime, enable DictConstruct in static runtime Test Plan: ``` ./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \ cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100 ``` ``` TEST trace Total num requests 100 Num exceptions 0 Latency us avg 180965 Latency us p25 89785 Latency us p50 131240 Latency us p75 146621 Latency us p90 158378 Latency us p95 166628 Latency us p99 1886680 Latency us p100 3803252 Server latency us avg 91554 Server latency us p25 51447 Server latency us p50 86371 Server latency us p75 95229 Server latency us p90 102706 Server latency us p95 116023 Server latency us p99 557017 Server latency us p100 716319 Num rankUnits avg 28 ``` Reviewed By: hlu1 Differential Revision: D27236682 fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5	2021-03-23 21:20:03 -07:00
Peng Wu	c06d979731	[Static Runtime] Name refactoring to make MemoryPlanning more readable (#54045 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54045 Test Plan: Imported from OSS Reviewed By: bwasti Differential Revision: D27233880 fbshipit-source-id: 43b38901d8cfea0941a1a2934997a08027b57b6d	2021-03-23 14:28:43 -07:00
Hao Lu	ca429fedd3	[StaticRuntime] Fuse SigridTransforms + ListUnpack (#53920 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920 Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime. The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls. Reviewed By: ajyu Differential Revision: D26220546 fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc	2021-03-17 19:58:02 -07:00
Hao Lu	04d5278cb6	[Static Runtime] Only run ReplaceWithCopy pass when enable_out_variant is true (#54111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111 If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation. Reviewed By: edvgha Differential Revision: D27036077 fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239	2021-03-16 22:06:33 -07:00
Hao Lu	4932342363	[Static Runtime] Fix bug in ClipRangesGatherRangesX2SigridHash (#53799 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799 Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute: - The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left. - The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards. Reviewed By: ajyu Differential Revision: D26920008 fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf	2021-03-12 13:15:44 -08:00
Bram Wasti	56f8379802	[static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564 Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime. This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions. ``` graph, schema = PrepareForStaticModule(torchscript_module) sm = StaticModule(graph, schema, options) sm(inputs) // or create many cheap runtimes with the module sr = StaticRuntime(sm) sr(inputs) ``` Changelist: - Rename InferenceModule StaticModule - Move all logic for construction into StaticModule - Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule) - Update comments with explanation - Propagate all changes to predictor integration - Propagate all changes to python integration - Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters). Test Plan: buck test //caffe2/test:static_runtime buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D25592967 fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f	2021-03-05 10:15:26 -08:00
Hao Lu	35364c3641	[static runtime] Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53324 Reviewed By: maratsubkhankulov Differential Revision: D26833478 fbshipit-source-id: 55ab63faf5b535f2acd2ec5dc5721f5b692832d7	2021-03-04 22:01:08 -08:00
Hao Lu	ac668c55e5	[Static Runtime] Remove dead code in MemoryPlanner and rename unmanaged_value_set to unmanaged_ivalue_set Test Plan: ``` buck test mode/opt //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: bwasti Differential Revision: D26827700 fbshipit-source-id: a8696af3e1d2b504fa5754f823b389d45b48af38	2021-03-04 17:37:43 -08:00
Hao Lu	d90d7245f4	[PyPer] Optimize sigrid_hash (#53065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53065 Reviewed By: ajyu Differential Revision: D26563512 fbshipit-source-id: a1a76f92ba500605ab2e3370737bd3965d81deb1	2021-03-03 01:31:53 -08:00
Bram Wasti	d4e64dad15	[static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684 With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations. `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations. Ideally every node will run out of place every time. Afaik, there's never a reason to disable it and we may want to remove that functionality. This diff does introduce a "leak" in the memory management - containers are not cleaned up. This only happens when out variants are enabled Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled Reviewed By: maratsubkhankulov, hlu1 Differential Revision: D26515801 fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78	2021-03-02 09:55:25 -08:00
Bram Wasti	2d67b76fa6	[static runtime] Add Alias analysis to Memory Management/Planning (#50060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060 Aliasing is currently mishandled in SR. This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff. However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward. This diff ``` $ batch=20 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69 PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04 $ batch=20 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41 PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9 PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799 PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036 ``` Test Plan: buck test //caffe2/test:static_runtime buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: eellison Differential Revision: D25581156 fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b	2021-03-02 09:53:32 -08:00
Hao Lu	7a178a8a52	[Static Runtime] Add memoray alloc/dealloc time to benchmark (#52902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52902 Add more metrics to track memory_alloc_time, memory_dealloc_time, and output_dealloc_time. Reviewed By: maratsubkhankulov Differential Revision: D26660715 fbshipit-source-id: 96c6cfac2d2ec66d4c31c84129721a846c3914f0	2021-02-25 22:55:14 -08:00
Hao Lu	72f9b3c8d5	[StaticRuntime] Add function to check for memory leak (#52342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52342 Reviewed By: yinghai Differential Revision: D26420826 fbshipit-source-id: 4023f80fadd21e192afa485d96acd37c845146be	2021-02-19 19:45:09 -08:00
Scott Wolchok	edf8130e9e	[PyTorch] Add set_data_ptr_noswap & use where possible (#52244 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52244 `StorageImpl::set_data_ptr` returns the old pointer and thus has to do extra work. Found because `std::swap<at::DataPtr>` was showing up in profiling, although at < 1%. ghstack-source-id: 121795131 Test Plan: Run AdIndexer benchmark under `perf stat`. Before: ``` 17,990.01 msec task-clock # 0.998 CPUs utilized ( +- 0.43% ) 6,550 context-switches # 0.364 K/sec ( +- 31.42% ) 3 cpu-migrations # 0.000 K/sec ( +- 7.14% ) 103,820 page-faults # 0.006 M/sec ( +- 2.47% ) 35,610,511,494 cycles # 1.979 GHz ( +- 0.40% ) (50.03%) 71,651,045,779 instructions # 2.01 insn per cycle ( +- 0.07% ) (50.02%) 11,679,947,910 branches # 649.246 M/sec ( +- 0.10% ) (50.03%) 69,088,927 branch-misses # 0.59% of all branches ( +- 0.24% ) (50.06% ``` After: ``` 17,896.20 msec task-clock # 0.999 CPUs utilized ( +- 0.24% ) 4,011 context-switches # 0.224 K/sec ( +- 27.77% ) 3 cpu-migrations # 0.000 K/sec 100,350 page-faults # 0.006 M/sec ( +- 1.58% ) 35,418,702,208 cycles # 1.979 GHz ( +- 0.23% ) (50.05%) 71,449,334,935 instructions # 2.02 insn per cycle ( +- 0.09% ) (50.03%) 11,652,819,899 branches # 651.134 M/sec ( +- 0.12% ) (50.04%) 69,744,411 branch-misses # 0.60% of all branches ( +- 0.53% ) (50.06%) ``` Cycles difference is within the noise, but it looks like we have an 0.28% instruction count win, which is outside the noise (and fits with intuition that this should be better). Reviewed By: hlu1 Differential Revision: D26437297 fbshipit-source-id: bf0fceccf6ad78f1497b03ccb4cdfd1a21c6846c	2021-02-17 12:42:21 -08:00
Hao Lu	4949eea0ff	[StaticRuntime] Clean up output references and remove dead code (#52237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237 Redo D26331506 (`4c58be4573`). Get rid of `nodiscard` which broke OSS CI. - Clean up references of outputs, including Tuples/Lists, by using move semantics - Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs. - Modify unit tests to check for use_counts of outputs - Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem. This diff does not try to fix the alias problem with the MemoryPlanner. Reviewed By: swolchok Differential Revision: D26432539 fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411	2021-02-13 20:05:28 -08:00
Mike Ruberry	992d251c39	Revert D26333953: [StaticRuntime] Clean up output references and remove dead code Test Plan: revert-hammer Differential Revision: D26333953 (`0c9d72b5e1`) Original commit changeset: cadc0595ad6a fbshipit-source-id: 75d0b33099342653cd8867b129139325789aee6c	2021-02-12 02:12:31 -08:00
Hao Lu	0c9d72b5e1	[StaticRuntime] Clean up output references and remove dead code (#51991 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991 - Clean up references of outputs, including Tuples/Lists, by using move semantics - Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs. - Modify unit tests to check for use_counts of outputs - Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem. This diff does not try to fix the alias problem with the MemoryPlanner. (Note: this ignores all push blocking failures!) Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test ``` Reviewed By: bwasti Differential Revision: D26333953 fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f	2021-02-12 01:11:08 -08:00
Hao Lu	4c58be4573	[StaticRuntime] Clean up input references (#51952 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952 StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check. Will clean up output tensors in separate diffs. Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test ``` Reviewed By: bwasti Differential Revision: D26331506 fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b	2021-02-11 13:46:19 -08:00
Hao Lu	11cda929fb	[StaticRuntime] Fix bug in MemoryPlanner (#51342 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342 There is a subtle bug with the MemoryPlanner with regard to view ops with out variant. ``` def forward(self, a: Tensor, shape: List[int]): b = a.reshape(shape) return b + b ``` In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const. To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part. Test Plan: Add unit test to enforce the constness of inputs ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: ajyu Differential Revision: D26144203 fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3	2021-01-29 21:16:02 -08:00
Hao Lu	d035d56bfb	[StaticRuntime] Add out variant for reshape and flatten (#51249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249 - Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case. - Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately. - The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor. Reviewed By: ajyu Differential Revision: D25992202 fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d	2021-01-27 22:44:11 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Bram Wasti	ace1680b68	[static runtime] Remove register concept by giving ownership to the nodes (#50050 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050 Every node will now own its outputs. I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers Largely, this is to enable more optimizations going forward. Test Plan: buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/test:static_runtime Reviewed By: hlu1 Differential Revision: D25571181 fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18	2021-01-07 10:19:58 -08:00
Bram Wasti	3ffe9e0f43	[static runtime] refine fusion group (#49340 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49340 This refines the fusion group to include on certain types of operations. We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented) Test Plan: ``` buck test mode/no-gpu caffe2/test:static_runtime buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: ZolotukhinM Differential Revision: D25520105 fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9	2020-12-15 12:57:35 -08:00
Scott Wolchok	743a4ef0ae	[PyTorch] Enable AutoNonVariableTypeMode in static runtime (#49199 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49199 This should save us an extra round of dispatch for resize_, resize_as_, detach_, and copy_, at the cost of disabling profiling and tracing. I'm told that static runtime has its own per-op profiling and we don't need tracing. ghstack-source-id: 118348314 Test Plan: Code review to confirm lack of need for profiling & tracing, and that there isn't a different switch we should be using instead. Internal benchmarks -- seeing 11-12% improvement in overall runtime Reviewed By: hlu1 Differential Revision: D25476819 fbshipit-source-id: 71e2c919b386b25c41084e2e4a54fe765a4f8f22	2020-12-10 21:51:59 -08:00
Bram Wasti	f4226b5c90	[static runtime] add static subgraph fusion pass (#49185 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185 This diff adds a fusion feature that will let us use static runtime for parts of the graph. This will prove useful in cases where fully eliminating control flow is hard etc. TODO: [x] factor out into separate fusion file [x] add python test case [x] add graph that isn't fully lowered test case [x] add graph that has weird list/tuple outputs test case the loop example looks quite good: ``` graph(%a.1 : Tensor, %b.1 : Tensor, %iters.1 : int): %12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4 %c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1) %c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4 block0(%i : int, %c.12 : Tensor): %c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1) -> (%12, %c.10) return (%c) with prim::StaticSubgraph_0 = graph(%0 : Tensor, %4 : Tensor): %5 : int = prim::Constant[value=2]() %6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12 %2 : int = prim::Constant[value=1]() %c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8 return (%c.2) with prim::StaticSubgraph_1 = graph(%1 : Tensor, %7 : Tensor, %8 : Tensor): %9 : int = prim::Constant[value=1]() %c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12 %5 : int = prim::Constant[value=2]() %c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8 %2 : int = prim::Constant[value=1]() %c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8 return (%c.10) ``` (Note: this ignores all push blocking failures!) Test Plan: buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test mode/no-gpu caffe2/test:static_runtime Reviewed By: bertmaher Differential Revision: D25385702 fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098	2020-12-10 14:03:11 -08:00
Bram Wasti	274ce26fd8	[static runtime] Add Internal Ops to the registry (#48616 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616 This adds a couple of _out variants and then registers them to the registry. I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors). In the future we can change this (with this D25062301) after removing `RecordFunction`, we see these results ``` BS=20 --- caffe2: 0.651617 ~ 0.666354 static runtime: 0.753481 pytorch: 0.866658 BS=1 --- caffe2: 0.0858684 ~ 0.08633 static runtime: 0.209897 pytorch: 0.232694 ``` Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr) Reviewed By: hlu1 Differential Revision: D25066823 fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80	2020-12-08 09:32:38 -08:00
Ansha Yu	07978bd62e	[static runtime] fuse inference ops (1) (#48948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48948 Fuse inference ops for the following inside static runtime: ConcatAddMulReplaceNaNClip CastedBatchOneHotLengths ConcatBatchMatMulBatchGather TODO: 1. add unit tests 2. add more restrictions on the graph transform (e.g. check inputs, check outputs not used elsewhere) Test Plan: Run adindexer model with static runtime and fusion; check ops ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/traced_precomputation2.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=3000 --warmup_iters=10000 --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime ``` transformed model graph contains the fused ops: P151559641 Results before fusion: P151567611 Results after fusion: P151566783 (8% speedup for bs=20, 14% speedup for bs=1) Reviewed By: hlu1 Differential Revision: D25224107 fbshipit-source-id: c8442e8ceb018879c61ce564367b1c1b9412601b	2020-12-08 05:54:49 -08:00
Scott Wolchok	55b93735ac	[PyTorch] Save refcount decrements in StaticRuntime::deallocate_registers (#48859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48859 Code comment should explain what's going on. If not, please request changes. ghstack-source-id: 117889942 Test Plan: Internal benchmarks Reviewed By: hlu1 Differential Revision: D25288842 fbshipit-source-id: 6bddebb99c4744e2f7aceb279fdf995821404606	2020-12-04 21:47:00 -08:00
Scott Wolchok	0f9823d888	[PyTorch] Save some space in ProcessedNode (#48861 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48861 `std::function` already has an empty state; no need to wrap it in `c10::Optional`. ghstack-source-id: 117891382 Reviewed By: hlu1 Differential Revision: D25296912 fbshipit-source-id: 8291bcf11735d49db17415b5de915591ee65f781	2020-12-04 14:42:20 -08:00

1 2

71 Commits