pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Hao Lu	ccd0977060	[Static Runtime] Support prim::GetAttr/SetAttr (#61505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505 The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module. Reviewed By: ajyu Differential Revision: D29350173 fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb	2021-07-10 14:06:06 -07:00
Hao Lu	bfe03120ee	[PyPer] Fix schema of fb::equally_split (#60852 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60852 Reviewed By: ajyu Differential Revision: D29423425 fbshipit-source-id: 4525db1f268ca65d6851a5ec846a6ae2f710ec6b	2021-06-30 03:18:15 -07:00
Hao Lu	1e31d26b1d	[Static Runtime] Fix bugs in static_runtime::to_copy (#60503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503 Fixed a few issues in the static_runtime::to_copy impl: - fixed a bug with memory_format - copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit. - fix the schema in the `ReplaceWithCopy` pass - add registration of `static_runtime::to_copy.other` Add more unit tests: - test dynamic shapes - test strided input tensor to `aten::to` - test alias case (same input/output) - test `to.other` Reviewed By: ajyu Differential Revision: D26838933 fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522	2021-06-23 19:57:17 -07:00
Nikita Shulga	4cb534f92e	Make PyTorch code-base clang-tidy compliant (#56892 ) Summary: This is an automatic change generated by the following script: ``` #!/usr/bin/env python3 from subprocess import check_output, check_call import os def get_compiled_files_list(): import json with open("build/compile_commands.json") as f: data = json.load(f) files = [os.path.relpath(node['file']) for node in data] for idx, fname in enumerate(files): if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'): files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')] return files def run_clang_tidy(fname): check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"]) changes = check_output(["git", "ls-files", "-m"]) if len(changes) == 0: return check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"]) def main(): git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n") compiled_files = get_compiled_files_list() for idx, fname in enumerate(git_files): if fname not in compiled_files: continue if fname.startswith("caffe2/contrib/aten/"): continue print(f"[{idx}/{len(git_files)}] Processing {fname}") run_clang_tidy(fname) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892 Reviewed By: H-Huang Differential Revision: D27991944 Pulled By: malfet fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179	2021-04-28 14:10:25 -07:00
Edvard Ghazaryan	a09bbe73fd	static runtime support for fb::equally_split (#56812 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812 fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split. So fb::equal_split will have as many outputs as ListUnpack . Test Plan: buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op Reviewed By: hlu1 Differential Revision: D27974999 fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35	2021-04-26 20:18:09 -07:00
Xiaodong Wang	ed0a0c3578	Revert D27902824: static runtime support for fb::equally_split Test Plan: revert-hammer Differential Revision: D27902824 (`a4e47ea152`) Original commit changeset: 7855047c3bd4 fbshipit-source-id: a46834418ce98826871cd604d1a01f0ff8f23d7f	2021-04-23 10:03:12 -07:00
Edvard Ghazaryan	a4e47ea152	static runtime support for fb::equally_split (#56565 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565 fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split. So fb::equal_split will have as many outputs as ListUnpack . Test Plan: buck test caffe2/torch/fb/sparsenn:fb_operators_test buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op Reviewed By: hlu1 Differential Revision: D27902824 fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46	2021-04-23 00:12:54 -07:00
Ansha Yu	e0be76fb9b	[static_runtime] fix num args for to_copy (#56441 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56441 Since aten::to is overloaded, match schema to replace it with static_runtime::to_copy Test Plan: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --benchmark_c2_predictor=0 --do_benchmark=0 ``` ``` Time per node type: 0.623426 ms. 55.337%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes) 0.331633 ms. 29.4367%. quantized::embedding_bag_byte_rowwise_offsets (71 nodes) 0.123163 ms. 10.9323%. aten::to (155 nodes) 0.038479 ms. 3.4155%. fb::lengths_to_offsets (155 nodes) 0.004169 ms. 0.370052%. aten::embedding_bag (2 nodes) 0.002549 ms. 0.226256%. static_runtime::to_copy (2 nodes) 0.002512 ms. 0.222972%. prim::TupleConstruct (1 nodes) 0.000667 ms. 0.0592048%. prim::dtype (2 nodes) 1.1266 ms. in Total StaticRuntime setup time: 0.009605 ms Memory allocation time: 0.001907 ms Memory deallocation time: 0.032401 ms Outputs deallocation time: 0.020876 ms Total memory managed: 256 bytes Total number of reused tensors: 159 ``` I verified that all of the aten::to matches, for the local, local_ro, and remote_ro nets in opt and dev mode. Only 2 of calls are replaced because the other 155 have either the input or the ouput of the op returned as an external output. This is a similar case for the other instances of aten::to in the local and local_ro nets. Reviewed By: hlu1 Differential Revision: D27872350 fbshipit-source-id: b72785ea2768be415faae2afcf9915aef07daec2	2021-04-21 16:31:36 -07:00
Hao Lu	c3d0607ffa	[Static Runtime] Make sure the copy version of the op exist in ReplaceWithCopy (#55337 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337 `static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not. Test Plan: This fixed two of the broken tests: ``` ✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316) ✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134) ``` Reviewed By: ajyu Differential Revision: D27577066 fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a	2021-04-06 04:25:04 -07:00
Ansha Yu	d49beba071	[pyper] out variant of sigrid_transforms_torch_bind + ListUnpack (#54761 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54761 Test Plan: Regen adindexer model that uses sigrid_transforms_torch_bind: /mnt/public/ansha/adindexer/merge20210323/adindexer_pt_traced_merge.pt ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=adindexer_pt_traced_merge.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge2/container_precomputation_bs1.pt --iters=30000 --warmup_iters=300000 --num_threads=1 --pred_net=c2_net_merge.pb --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 ``` Before ms/iter: 0.0647056 After ms/iter: 0.0581197 Reviewed By: hlu1 Differential Revision: D27239617 fbshipit-source-id: dffe6cbaf3a783c41605c97c5947a36e3b1b1f3b	2021-03-30 10:54:44 -07:00
Hao Lu	46e7f6773f	[Static Runtime] Check for inplace ops explicitly in ReplaceWithCopy (#54657 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54657 The constraint checked in D27145406 (`acf03b13f1`) is too tight for the adindexer model and as a result, 5 ops (4 aten::narrow + 1 aten::premute) are not replaced with the copy version and resulted in perf regression. This diff checks for inplace ops explicitly and only applies the input constraint to graphs with inplace ops. Test Plan: Contbuild Reviewed By: ajyu Differential Revision: D27253145 fbshipit-source-id: 23e2b1a018c84dd0fc2880fddd9c41bc0422b8eb	2021-03-30 07:08:00 -07:00
Hao Lu	8294bff20d	[StaticRuntime] Copy version of reshape/flatten (#54353 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353 The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor. Reviewed By: ajyu, edvgha Differential Revision: D26358525 fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87	2021-03-20 16:55:30 -07:00
Hao Lu	acf03b13f1	[Static Runtime] Check for number of uses of op inputs > 1 in ReplaceWithCopy (#54230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54230 The comments in the code explained why this change is needed. Reviewed By: bwasti Differential Revision: D27145406 fbshipit-source-id: 2a61a42f22dfadfad59ee6c3be3e9e9d19e90ac3	2021-03-18 20:02:20 -07:00
Hao Lu	ca429fedd3	[StaticRuntime] Fuse SigridTransforms + ListUnpack (#53920 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920 Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime. The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls. Reviewed By: ajyu Differential Revision: D26220546 fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc	2021-03-17 19:58:02 -07:00
Hao Lu	04d5278cb6	[Static Runtime] Only run ReplaceWithCopy pass when enable_out_variant is true (#54111 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111 If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation. Reviewed By: edvgha Differential Revision: D27036077 fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239	2021-03-16 22:06:33 -07:00
Hao Lu	4932342363	[Static Runtime] Fix bug in ClipRangesGatherRangesX2SigridHash (#53799 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799 Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute: - The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left. - The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards. Reviewed By: ajyu Differential Revision: D26920008 fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf	2021-03-12 13:15:44 -08:00
Hao Lu	409a76f72c	[Static Runtime] Fix bug in static_runtime::to_copy (#53634 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53634 Make the op signature of `static_runtime::to_copy` consistent with that of native_functions.yaml so it works with 2-5 args: ``` - func: to.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor variants: method device_guard: False ``` (Note: this ignores all push blocking failures!) Reviewed By: ajyu Differential Revision: D26906726 fbshipit-source-id: b9203eb23619aba42b1bfed1a077401f9fe2ddf0	2021-03-09 16:26:34 -08:00
Hao Lu	2dffb4e38e	[Static Runtime] Back out D26659824 (#53570 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53570 Reviewed By: allwu Differential Revision: D26899099 fbshipit-source-id: 87c6d74a91c102e6b0487f9e6f49394755792a94	2021-03-08 22:14:15 -08:00
Ansha Yu	7c0a4e78ca	[static runtime] convert to->to_copy (#53524 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53524 Add to->to_copy in the ReplaceWithCopy pass for playing well with AliasDb Test Plan: Run bench with CastedBatchOneHot fusion off (https://www.internalfb.com/intern/diff/view-version/123230476/), on adindexer and adfinder models Reviewed By: hlu1 Differential Revision: D26887050 fbshipit-source-id: 3f2fb9e27783bcdeb91c8b4181575f059317aff1	2021-03-08 16:19:03 -08:00
Hao Lu	35364c3641	[static runtime] Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53324 Reviewed By: maratsubkhankulov Differential Revision: D26833478 fbshipit-source-id: 55ab63faf5b535f2acd2ec5dc5721f5b692832d7	2021-03-04 22:01:08 -08:00
Marat Subkhankulov	47dbdfcfe9	[Static Runtime] remove redundant gather_ranges when fusing (#53323 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53323 Whilst optimizing inline cvr local ro, found a pattern where gather_ranges is used redundantly. Fuse this pattern to remove unnecessary gather_ranges. Reviewed By: hlu1 Differential Revision: D26659824 fbshipit-source-id: 6420afa3a2c3272c57706b70c2e9834014d6c32d	2021-03-04 18:14:29 -08:00
Ansha Yu	9b7396e7e2	[pyper] casted_batch_one_hot_lengths with 4-arg to (#53215 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53215 The current 5-arg version doesn't fuse the inline_cvr model instances Test Plan: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_weights=/data/users/ansha/tmp/adfinder/models/c2_local_weight_data.pb --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_local_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_local_net.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/local_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --compare_results=1 --iters=2000 --warmup_iters=2000 --num_threads=1 --do_profile=1 --do_benchmark --benchmark_c2_predictor=1 ``` ``` Time per node type: 3.82029 ms. 71.8523%. aten::addmm (9 nodes) 0.926298 ms. 17.4219%. fb::sigrid_transforms (1 nodes) 0.122496 ms. 2.30391%. fb::clip_ranges_gather (210 nodes) 0.11985 ms. 2.25416%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (54 nodes) 0.0973721 ms. 1.83138%. aten::sigmoid (3 nodes) 0.0352937 ms. 0.663807%. fb::batch_box_cox (1 nodes) 0.034759 ms. 0.65375%. prim::TupleConstruct (1 nodes) 0.0222235 ms. 0.417981%. aten::index (4 nodes) 0.0215314 ms. 0.404964%. fb::casted_batch_one_hot_lengths (1 nodes) 0.0199659 ms. 0.375521%. fb::concat_add_mul_replacenan_clip (1 nodes) 0.0192885 ms. 0.362779%. aten::cat (2 nodes) 0.0181285 ms. 0.340963%. aten::mul (2 nodes) 0.0109381 ms. 0.205725%. aten::pow (1 nodes) 0.0091476 ms. 0.172049%. prim::ListConstruct (8 nodes) 0.00794012 ms. 0.149338%. aten::relu (2 nodes) 0.00668873 ms. 0.125802%. prim::ListUnpack (1 nodes) 0.00569745 ms. 0.107158%. aten::to (4 nodes) 0.00527507 ms. 0.099214%. aten::narrow_copy (4 nodes) 0.00483189 ms. 0.0908785%. fb::lengths_range (4 nodes) 0.00399056 ms. 0.0750548%. aten::logit (1 nodes) 0.00324574 ms. 0.0610462%. fb::gather_ranges (4 nodes) 0.00161166 ms. 0.0303122%. fb::clip_ranges (2 nodes) 5.31686 ms. in Total StaticRuntime setup time: 0.016461 ms Memory allocation time: 0.00220284 ms Memory deallocation time: 0.118134 ms Outputs deallocation time: 0.0674883 ms Total memory managed: 716352 bytes Total number of reused tensors: 22 ``` Reviewed By: hlu1 Differential Revision: D26789260 fbshipit-source-id: 52adadddaae29a946de8a58bd592c06e6d4ce8c8	2021-03-03 16:41:39 -08:00
Hao Lu	d90d7245f4	[PyPer] Optimize sigrid_hash (#53065 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53065 Reviewed By: ajyu Differential Revision: D26563512 fbshipit-source-id: a1a76f92ba500605ab2e3370737bd3965d81deb1	2021-03-03 01:31:53 -08:00
Bram Wasti	2d67b76fa6	[static runtime] Add Alias analysis to Memory Management/Planning (#50060 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060 Aliasing is currently mishandled in SR. This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff. However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward. This diff ``` $ batch=20 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69 PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04 $ batch=20 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41 PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9 PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5 $ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true \|& grep "finished" C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799 PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036 ``` Test Plan: buck test //caffe2/test:static_runtime buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: eellison Differential Revision: D25581156 fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b	2021-03-02 09:53:32 -08:00
Ansha Yu	ec42c2d89c	[pyper] fuse clip_ranges+gather_ranges (#52461 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52461 TODO: add tests Test Plan: Before: 7.10623 ms/iter 0.0849279 ms. 1.21267%. fb::clip_ranges (212 nodes) 0.254071 ms. 3.62783%. fb::gather_ranges (214 nodes) After: 7.0654 ms/iter 0.300174 ms. 4.2739%. fb::clip_ranges_gather (264 nodes) Reviewed By: hlu1 Differential Revision: D26523903 fbshipit-source-id: 9b2420c522232659b198cbe250d4454bbcd9297b	2021-03-01 14:50:39 -08:00
Shijun Kong	158c98ae49	Add new patterns for ConcatAddMulReplaceNaNClip (#50249 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50249 Add a few new patterns for `ConcatAddMulReplaceNanClip` Reviewed By: houseroad Differential Revision: D25843126 fbshipit-source-id: d4987c716cf085f2198234651a2214591d8aacc0	2021-01-12 10:20:01 -08:00
Andres Suarez	8530c65e25	[codemod][fbcode/caffe2] Apply clang-format update fixes Test Plan: Sandcastle and visual inspection. Reviewed By: igorsugak Differential Revision: D25849205 fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0	2021-01-09 14:37:36 -08:00
Edvard Ghazaryan	a111a9291c	added fuse_op and list_construct - list_unpack pass Summary: Added fuse_op and list_construct and list_unpack pass Test Plan: jit_graph_opt_test.py jit_graph_optimizer_test.cc sparsenn_fused_operator_test.py Reviewed By: qizzzh Differential Revision: D25715079 fbshipit-source-id: fa976be53135a83f262b8f2e2eaedadd177f46c4	2020-12-29 12:29:53 -08:00
Ansha Yu	c18af03a41	[pt] fuse ClipRangesGatherSigridHash (#49181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49181 Fuse ClipRangesGatherSigridHash Test Plan: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/merge/traced_merge_dper_fixes.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000 --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime --pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile --compare_results ``` Verify op fused: Node #3: 0.00104917 ms/iter, %173 : Tensor, %174 : Tensor = fb::clip_ranges_gather_sigrid_hash_offsets(%75, %76, %39, %40, %41, %38, %26) Before: 0.0919786 After: 0.0911792 Reviewed By: hlu1 Differential Revision: D25468225 fbshipit-source-id: 36bd91c140eaa57cb42cdaad46d878b94f162a9d	2020-12-17 00:42:46 -08:00
Hao Lu	8954eb3f72	[StaticRuntime] Fusion pass for ClipRanges/GatherRanges/LengthsToOffsets (#49113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49113 Reviewed By: ajyu Differential Revision: D25388512 fbshipit-source-id: 3daa5b9387a3a10b6c220688df06540c4d844aea	2020-12-16 00:34:49 -08:00
Ansha Yu	07978bd62e	[static runtime] fuse inference ops (1) (#48948 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48948 Fuse inference ops for the following inside static runtime: ConcatAddMulReplaceNaNClip CastedBatchOneHotLengths ConcatBatchMatMulBatchGather TODO: 1. add unit tests 2. add more restrictions on the graph transform (e.g. check inputs, check outputs not used elsewhere) Test Plan: Run adindexer model with static runtime and fusion; check ops ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/traced_precomputation2.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=3000 --warmup_iters=10000 --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime ``` transformed model graph contains the fused ops: P151559641 Results before fusion: P151567611 Results after fusion: P151566783 (8% speedup for bs=20, 14% speedup for bs=1) Reviewed By: hlu1 Differential Revision: D25224107 fbshipit-source-id: c8442e8ceb018879c61ce564367b1c1b9412601b	2020-12-08 05:54:49 -08:00

31 Commits