Commit Graph

31 Commits

Author SHA1 Message Date
Hao Lu
ccd0977060 [Static Runtime] Support prim::GetAttr/SetAttr (#61505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505

The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.

Reviewed By: ajyu

Differential Revision: D29350173

fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
2021-07-10 14:06:06 -07:00
Hao Lu
bfe03120ee [PyPer] Fix schema of fb::equally_split (#60852)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60852

Reviewed By: ajyu

Differential Revision: D29423425

fbshipit-source-id: 4525db1f268ca65d6851a5ec846a6ae2f710ec6b
2021-06-30 03:18:15 -07:00
Hao Lu
1e31d26b1d [Static Runtime] Fix bugs in static_runtime::to_copy (#60503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503

Fixed a few issues in the static_runtime::to_copy impl:
- fixed a bug with memory_format
- copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit.
- fix the schema in the `ReplaceWithCopy` pass
- add registration of `static_runtime::to_copy.other`

Add more unit tests:
- test dynamic shapes
- test strided input tensor to `aten::to`
- test alias case (same input/output)
- test `to.other`

Reviewed By: ajyu

Differential Revision: D26838933

fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522
2021-06-23 19:57:17 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Edvard Ghazaryan
a09bbe73fd static runtime support for fb::equally_split (#56812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27974999

fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
2021-04-26 20:18:09 -07:00
Xiaodong Wang
ed0a0c3578 Revert D27902824: static runtime support for fb::equally_split
Test Plan: revert-hammer

Differential Revision:
D27902824 (a4e47ea152)

Original commit changeset: 7855047c3bd4

fbshipit-source-id: a46834418ce98826871cd604d1a01f0ff8f23d7f
2021-04-23 10:03:12 -07:00
Edvard Ghazaryan
a4e47ea152 static runtime support for fb::equally_split (#56565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test

buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27902824

fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
2021-04-23 00:12:54 -07:00
Ansha Yu
e0be76fb9b [static_runtime] fix num args for to_copy (#56441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56441

Since aten::to is overloaded, match schema to replace it with static_runtime::to_copy

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --benchmark_c2_predictor=0 --do_benchmark=0
```

```
Time per node type:
       0.623426 ms.     55.337%. quantized::embedding_bag_4bit_rowwise_offsets (82 nodes)
       0.331633 ms.    29.4367%. quantized::embedding_bag_byte_rowwise_offsets (71 nodes)
       0.123163 ms.    10.9323%. aten::to (155 nodes)
       0.038479 ms.     3.4155%. fb::lengths_to_offsets (155 nodes)
       0.004169 ms.   0.370052%. aten::embedding_bag (2 nodes)
       0.002549 ms.   0.226256%. static_runtime::to_copy (2 nodes)
       0.002512 ms.   0.222972%. prim::TupleConstruct (1 nodes)
       0.000667 ms.  0.0592048%. prim::dtype (2 nodes)
         1.1266 ms. in Total
StaticRuntime setup time: 0.009605 ms
Memory allocation time: 0.001907 ms
Memory deallocation time: 0.032401 ms
Outputs deallocation time: 0.020876 ms
Total memory managed: 256 bytes
Total number of reused tensors: 159
```

I verified that all of the aten::to matches, for the local, local_ro, and remote_ro nets in opt and dev mode.

Only 2 of calls are replaced because the other 155 have either the input or the ouput of the op returned as an external output. This is a similar case for the other instances of aten::to in the local and local_ro nets.

Reviewed By: hlu1

Differential Revision: D27872350

fbshipit-source-id: b72785ea2768be415faae2afcf9915aef07daec2
2021-04-21 16:31:36 -07:00
Hao Lu
c3d0607ffa [Static Runtime] Make sure the copy version of the op exist in ReplaceWithCopy (#55337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337

`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.

Test Plan:
This fixed two of the broken tests:
```
    ✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
    ✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```

Reviewed By: ajyu

Differential Revision: D27577066

fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
2021-04-06 04:25:04 -07:00
Ansha Yu
d49beba071 [pyper] out variant of sigrid_transforms_torch_bind + ListUnpack (#54761)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54761

Test Plan:
Regen adindexer model that uses sigrid_transforms_torch_bind: /mnt/public/ansha/adindexer/merge20210323/adindexer_pt_traced_merge.pt

```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=adindexer_pt_traced_merge.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge2/container_precomputation_bs1.pt --iters=30000 --warmup_iters=300000 --num_threads=1 --pred_net=c2_net_merge.pb --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1
```

Before ms/iter: 0.0647056
After ms/iter: 0.0581197

Reviewed By: hlu1

Differential Revision: D27239617

fbshipit-source-id: dffe6cbaf3a783c41605c97c5947a36e3b1b1f3b
2021-03-30 10:54:44 -07:00
Hao Lu
46e7f6773f [Static Runtime] Check for inplace ops explicitly in ReplaceWithCopy (#54657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54657

The constraint checked in D27145406 (acf03b13f1) is too tight for the adindexer model and as a result, 5 ops (4 aten::narrow + 1 aten::premute) are not replaced with the copy version and resulted in perf regression. This diff checks for inplace ops explicitly and only applies the input constraint to graphs with inplace ops.

Test Plan: Contbuild

Reviewed By: ajyu

Differential Revision: D27253145

fbshipit-source-id: 23e2b1a018c84dd0fc2880fddd9c41bc0422b8eb
2021-03-30 07:08:00 -07:00
Hao Lu
8294bff20d [StaticRuntime] Copy version of reshape/flatten (#54353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353

The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor.

Reviewed By: ajyu, edvgha

Differential Revision: D26358525

fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87
2021-03-20 16:55:30 -07:00
Hao Lu
acf03b13f1 [Static Runtime] Check for number of uses of op inputs > 1 in ReplaceWithCopy (#54230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54230

The comments in the code explained why this change is needed.

Reviewed By: bwasti

Differential Revision: D27145406

fbshipit-source-id: 2a61a42f22dfadfad59ee6c3be3e9e9d19e90ac3
2021-03-18 20:02:20 -07:00
Hao Lu
ca429fedd3 [StaticRuntime] Fuse SigridTransforms + ListUnpack (#53920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920

Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.

The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.

Reviewed By: ajyu

Differential Revision: D26220546

fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
2021-03-17 19:58:02 -07:00
Hao Lu
04d5278cb6 [Static Runtime] Only run ReplaceWithCopy pass when enable_out_variant is true (#54111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111

If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.

Reviewed By: edvgha

Differential Revision: D27036077

fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
2021-03-16 22:06:33 -07:00
Hao Lu
4932342363 [Static Runtime] Fix bug in ClipRangesGatherRangesX2SigridHash (#53799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799

Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.

Reviewed By: ajyu

Differential Revision: D26920008

fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
2021-03-12 13:15:44 -08:00
Hao Lu
409a76f72c [Static Runtime] Fix bug in static_runtime::to_copy (#53634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53634

Make the op signature of `static_runtime::to_copy` consistent with that of native_functions.yaml so it works with 2-5 args:
```
- func: to.dtype(Tensor self, ScalarType dtype, bool non_blocking=False, bool copy=False, MemoryFormat? memory_format=None) -> Tensor
  variants: method
  device_guard: False
```

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D26906726

fbshipit-source-id: b9203eb23619aba42b1bfed1a077401f9fe2ddf0
2021-03-09 16:26:34 -08:00
Hao Lu
2dffb4e38e [Static Runtime] Back out D26659824 (#53570)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53570

Reviewed By: allwu

Differential Revision: D26899099

fbshipit-source-id: 87c6d74a91c102e6b0487f9e6f49394755792a94
2021-03-08 22:14:15 -08:00
Ansha Yu
7c0a4e78ca [static runtime] convert to->to_copy (#53524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53524

Add to->to_copy in the ReplaceWithCopy pass for playing well with
AliasDb

Test Plan:
Run bench with CastedBatchOneHot fusion off
(https://www.internalfb.com/intern/diff/view-version/123230476/),
on adindexer and adfinder models

Reviewed By: hlu1

Differential Revision: D26887050

fbshipit-source-id: 3f2fb9e27783bcdeb91c8b4181575f059317aff1
2021-03-08 16:19:03 -08:00
Hao Lu
35364c3641 [static runtime] Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53324

Reviewed By: maratsubkhankulov

Differential Revision: D26833478

fbshipit-source-id: 55ab63faf5b535f2acd2ec5dc5721f5b692832d7
2021-03-04 22:01:08 -08:00
Marat Subkhankulov
47dbdfcfe9 [Static Runtime] remove redundant gather_ranges when fusing (#53323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53323

Whilst optimizing inline cvr local ro, found a pattern where gather_ranges is used redundantly. Fuse this pattern to remove unnecessary gather_ranges.

Reviewed By: hlu1

Differential Revision: D26659824

fbshipit-source-id: 6420afa3a2c3272c57706b70c2e9834014d6c32d
2021-03-04 18:14:29 -08:00
Ansha Yu
9b7396e7e2 [pyper] casted_batch_one_hot_lengths with 4-arg to (#53215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53215

The current 5-arg version doesn't fuse the inline_cvr model instances

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_weights=/data/users/ansha/tmp/adfinder/models/c2_local_weight_data.pb --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_local_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_local_net.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/local_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --compare_results=1 --iters=2000 --warmup_iters=2000 --num_threads=1 --do_profile=1 --do_benchmark --benchmark_c2_predictor=1
```

```
Time per node type:
        3.82029 ms.    71.8523%. aten::addmm (9 nodes)
       0.926298 ms.    17.4219%. fb::sigrid_transforms (1 nodes)
       0.122496 ms.    2.30391%. fb::clip_ranges_gather (210 nodes)
        0.11985 ms.    2.25416%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (54 nodes)
      0.0973721 ms.    1.83138%. aten::sigmoid (3 nodes)
      0.0352937 ms.   0.663807%. fb::batch_box_cox (1 nodes)
       0.034759 ms.    0.65375%. prim::TupleConstruct (1 nodes)
      0.0222235 ms.   0.417981%. aten::index (4 nodes)
      0.0215314 ms.   0.404964%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0199659 ms.   0.375521%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0192885 ms.   0.362779%. aten::cat (2 nodes)
      0.0181285 ms.   0.340963%. aten::mul (2 nodes)
      0.0109381 ms.   0.205725%. aten::pow (1 nodes)
      0.0091476 ms.   0.172049%. prim::ListConstruct (8 nodes)
     0.00794012 ms.   0.149338%. aten::relu (2 nodes)
     0.00668873 ms.   0.125802%. prim::ListUnpack (1 nodes)
     0.00569745 ms.   0.107158%. aten::to (4 nodes)
     0.00527507 ms.   0.099214%. aten::narrow_copy (4 nodes)
     0.00483189 ms.  0.0908785%. fb::lengths_range (4 nodes)
     0.00399056 ms.  0.0750548%. aten::logit (1 nodes)
     0.00324574 ms.  0.0610462%. fb::gather_ranges (4 nodes)
     0.00161166 ms.  0.0303122%. fb::clip_ranges (2 nodes)
        5.31686 ms. in Total
StaticRuntime setup time: 0.016461 ms
Memory allocation time: 0.00220284 ms
Memory deallocation time: 0.118134 ms
Outputs deallocation time: 0.0674883 ms
Total memory managed: 716352 bytes
Total number of reused tensors: 22
```

Reviewed By: hlu1

Differential Revision: D26789260

fbshipit-source-id: 52adadddaae29a946de8a58bd592c06e6d4ce8c8
2021-03-03 16:41:39 -08:00
Hao Lu
d90d7245f4 [PyPer] Optimize sigrid_hash (#53065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53065

Reviewed By: ajyu

Differential Revision: D26563512

fbshipit-source-id: a1a76f92ba500605ab2e3370737bd3965d81deb1
2021-03-03 01:31:53 -08:00
Bram Wasti
2d67b76fa6 [static runtime] Add Alias analysis to Memory Management/Planning (#50060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060

Aliasing is currently mishandled in SR.

This diff fixes that issue entirely and allows us to avoid hard coded "view" registration.  I'll remove the macro in a follow up diff.

However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias."  Some care will need to be taken to actually make sure this is enforced going forward.

This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04

$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: eellison

Differential Revision: D25581156

fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
2021-03-02 09:53:32 -08:00
Ansha Yu
ec42c2d89c [pyper] fuse clip_ranges+gather_ranges (#52461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52461

TODO: add tests

Test Plan:
Before:
7.10623 ms/iter
0.0849279 ms.    1.21267%. fb::clip_ranges (212 nodes)
0.254071 ms.    3.62783%. fb::gather_ranges (214 nodes)

After:
7.0654 ms/iter
0.300174 ms.     4.2739%. fb::clip_ranges_gather (264 nodes)

Reviewed By: hlu1

Differential Revision: D26523903

fbshipit-source-id: 9b2420c522232659b198cbe250d4454bbcd9297b
2021-03-01 14:50:39 -08:00
Shijun Kong
158c98ae49 Add new patterns for ConcatAddMulReplaceNaNClip (#50249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50249

Add a few new patterns for `ConcatAddMulReplaceNanClip`

Reviewed By: houseroad

Differential Revision: D25843126

fbshipit-source-id: d4987c716cf085f2198234651a2214591d8aacc0
2021-01-12 10:20:01 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Edvard Ghazaryan
a111a9291c added fuse_op and list_construct - list_unpack pass
Summary: Added fuse_op and list_construct and list_unpack pass

Test Plan:
jit_graph_opt_test.py
jit_graph_optimizer_test.cc
sparsenn_fused_operator_test.py

Reviewed By: qizzzh

Differential Revision: D25715079

fbshipit-source-id: fa976be53135a83f262b8f2e2eaedadd177f46c4
2020-12-29 12:29:53 -08:00
Ansha Yu
c18af03a41 [pt] fuse ClipRangesGatherSigridHash (#49181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49181

Fuse ClipRangesGatherSigridHash

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/merge/traced_merge_dper_fixes.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime --pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile --compare_results
```

Verify op fused:
Node #3: 0.00104917 ms/iter, %173 : Tensor, %174 : Tensor = fb::clip_ranges_gather_sigrid_hash_offsets(%75, %76, %39, %40, %41, %38, %26)

Before: 0.0919786
After: 0.0911792

Reviewed By: hlu1

Differential Revision: D25468225

fbshipit-source-id: 36bd91c140eaa57cb42cdaad46d878b94f162a9d
2020-12-17 00:42:46 -08:00
Hao Lu
8954eb3f72 [StaticRuntime] Fusion pass for ClipRanges/GatherRanges/LengthsToOffsets (#49113)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49113

Reviewed By: ajyu

Differential Revision: D25388512

fbshipit-source-id: 3daa5b9387a3a10b6c220688df06540c4d844aea
2020-12-16 00:34:49 -08:00
Ansha Yu
07978bd62e [static runtime] fuse inference ops (1) (#48948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48948

Fuse inference ops for the following inside static runtime:
ConcatAddMulReplaceNaNClip
CastedBatchOneHotLengths
ConcatBatchMatMulBatchGather

TODO:
1. add unit tests
2. add more restrictions on the graph transform (e.g. check inputs, check outputs not used elsewhere)

Test Plan:
Run adindexer model with static runtime and fusion; check ops
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/traced_precomputation2.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=3000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime
```
transformed model graph contains the fused ops: P151559641

Results before fusion: P151567611
Results after fusion: P151566783 (8% speedup for bs=20, 14% speedup for bs=1)

Reviewed By: hlu1

Differential Revision: D25224107

fbshipit-source-id: c8442e8ceb018879c61ce564367b1c1b9412601b
2020-12-08 05:54:49 -08:00