Commit Graph

88 Commits

Author SHA1 Message Date
Hao Lu
1f83d8eec2 [Static Runtime] Return nullptr if the number of input args doesn't match (#58018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58018

- Add checks for the number of input args and return nullptr if it doesn't match. This is intended to make Static Runtime more robust so that op schema change is less likely to break things. Imagine that a new arg is added to an op or a new overload is added that has this added arg, SR would simply ignore this added arg. If this arg has a default value, SR would run the model with the default value and give you wrong results, which can be hard to track down.

Reviewed By: ajyu

Differential Revision: D28047955

fbshipit-source-id: 01067059edd5cfea80c4ee121829f7733b11f601
2021-05-11 16:30:45 -07:00
Edvard Ghazaryan
dd876120f9 Out version for aten::repeat (#57683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57683

Support aten::repeat for static runtime

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D27639482

fbshipit-source-id: e6e706cb1d52750eea74f19536245f0484e945e6
2021-05-11 13:21:58 -07:00
Hao Lu
8bbe383877 [Static Runtime] Fix bugs in logit (#57578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57578

The original impl in SR assumes that eps is a constant, which is true most of the times. However it could be a graph input as well. This diff fixes this issue. Unit tests are added as well.

Reviewed By: edvgha

Differential Revision: D28207975

fbshipit-source-id: 9a10dec159f3804e43ef74aaa20c3ec6c79548c9
2021-05-05 23:38:15 -07:00
Mikhail Zolotukhin
9e7814d539 Reland: [StaticRuntime] Use NNC's call_raw API to reduce call overheads. (#57553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57553

Relanding #57329 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28195048

Pulled By: ZolotukhinM

fbshipit-source-id: 50052a2f20f84940b83d1dd1241c8659ff06e014
2021-05-05 09:11:38 -07:00
Hao Lu
5439977352 [Static Runtime] Revamp op schema check (#57521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521

When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.

This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:

```
  if (node->kind() != prim::ListConstruct &&
      node->kind() != prim::TupleConstruct &&
      node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
    const Operator& op = node->getOperator();
    TORCH_CHECK(op.hasOperation());
    op_ = op.getOperation(node);
    VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
  }
```

The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D27531447

fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
2021-05-04 02:48:04 -07:00
Mike Ruberry
3315f14280 Revert D28110358: [StaticRuntime] Use NNC's call_raw API to reduce call overheads.
Test Plan: revert-hammer

Differential Revision:
D28110358 (400ca7677c)

Original commit changeset: 94b87130a1ff

fbshipit-source-id: 246c0e54b02443c039105f48c4c419fe281150fc
2021-05-01 15:35:34 -07:00
Mikhail Zolotukhin
400ca7677c [StaticRuntime] Use NNC's call_raw API to reduce call overheads. (#57329)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57329

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28110358

Pulled By: ZolotukhinM

fbshipit-source-id: 94b87130a1ffdb4acf171ddcea3895e8a75c34ac
2021-04-30 15:26:20 -07:00
Edvard Ghazaryan
e62cdae469 Static Runtime support for aten::matmul (#57291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57291

aten::matmul support for static runtime

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_Binary_MatMul

Reviewed By: hlu1

Differential Revision: D28099671

fbshipit-source-id: 784035060c8c24953df47ca4227d2bca5094da22
2021-04-30 10:49:55 -07:00
Edvard Ghazaryan
b3e1802439 Static runtime support for fb::expand_dims (#57282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57282

Added support for fb::expand_dims for SR.

Test Plan:
buck test caffe2/torch/fb/sparsenn:gpu_test -- test_expand_dims

buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

Reviewed By: hlu1

Differential Revision: D28043049

fbshipit-source-id: 01f59db7b507f027b220f044d6ff23602adbdb06
2021-04-29 22:40:56 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Ansha Yu
46321cb937 [static runtime] binding for aten::norm_out (#56636)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56636

Test Plan:
Test it runs on the aug_1x model, which has aten::norm, and verify jit/sr results
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.53159 ms.    35.8619%. fb::sigrid_transforms_torch_bind (1 nodes)
         0.9481 ms.    22.1996%. aten::linear (6 nodes)
       0.704806 ms.    16.5029%. aten::argmin (1 nodes)
       0.252252 ms.    5.90643%. aten::matmul (1 nodes)
       0.140869 ms.    3.29842%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.100014 ms.    2.34181%. fb::clip_ranges_gather (263 nodes)
      0.0880838 ms.    2.06247%. aten::sub (1 nodes)
      0.0553556 ms.    1.29614%. aten::repeat (1 nodes)
      0.0438464 ms.    1.02665%. aten::norm (1 nodes)
      0.0395956 ms.   0.927124%. fb::batch_box_cox (1 nodes)
       0.035834 ms.   0.839045%. aten::__getitem__ (506 nodes)
      0.0345233 ms.   0.808357%. prim::TupleUnpack (254 nodes)
      0.0316876 ms.   0.741959%. aten::sigmoid (2 nodes)
      0.0293246 ms.   0.686629%. aten::mul (3 nodes)
      0.0287696 ms.   0.673635%. fb::offsets_to_ranges (253 nodes)
      0.0242373 ms.   0.567511%. aten::pow (1 nodes)
      0.0224204 ms.    0.52497%. fb::simple_embedding_bag_sum (3 nodes)
      0.0200074 ms.   0.468469%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0190264 ms.   0.445499%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0167253 ms.    0.39162%. prim::TupleConstruct (1 nodes)
      0.0164962 ms.   0.386255%. aten::sum (3 nodes)
      0.0158986 ms.   0.372262%. prim::DictConstruct (2 nodes)
      0.0109372 ms.   0.256093%. aten::div (1 nodes)
     0.00910563 ms.   0.213207%. prim::ListConstruct (4 nodes)
     0.00876917 ms.   0.205328%. static_runtime::to_copy (8 nodes)
     0.00822567 ms.   0.192603%. fb::sigrid_hash_precompute (1 nodes)
     0.00622559 ms.   0.145771%. aten::contiguous (1 nodes)
     0.00460064 ms.   0.107723%. aten::narrow (4 nodes)
     0.00297164 ms.  0.0695804%. static_runtime::reshape_copy (2 nodes)
     0.00287099 ms.  0.0672237%. aten::logit (1 nodes)
     0.00277557 ms.  0.0649894%. aten::add (1 nodes)
     0.00264978 ms.  0.0620441%. aten::clamp_min (1 nodes)
     0.00215832 ms.  0.0505366%. aten::relu (1 nodes)
     0.00213779 ms.   0.050056%. fb::gather_ranges (4 nodes)
     0.00195846 ms.  0.0458571%. aten::full (1 nodes)
     0.00177333 ms.  0.0415222%. aten::stack (1 nodes)
     0.00147449 ms.   0.034525%. aten::size (3 nodes)
    0.000762524 ms.  0.0178544%. aten::expand_as (1 nodes)
    0.000757406 ms.  0.0177345%. fb::clip_ranges (2 nodes)
    0.000614798 ms.  0.0143954%. fb::lengths_to_offsets (3 nodes)
    0.000407952 ms. 0.00955212%. static_runtime::flatten_copy (1 nodes)
    0.000159918 ms. 0.00374445%. prim::device (1 nodes)
         4.2708 ms. in Total
StaticRuntime setup time: 0.000407 ms
Memory allocation time: 0.0089714 ms
Memory deallocation time: 0.0592135 ms
Outputs deallocation time: 0.0458097 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 28
```

Reviewed By: hlu1

Differential Revision: D27922070

fbshipit-source-id: 538b39b7fff0638fc994b7983bf32d9e9f15d016
2021-04-28 08:44:10 -07:00
Edvard Ghazaryan
cea265b8d8 Support layer_norm for static runtime (#56444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56444

Added out version for layer_norm

Test Plan:
buck test caffe2/aten:math_kernel_test -- NativeLayerNorm

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D27873846

fbshipit-source-id: 53ee9fec4ff9a4e78198b031e86b5afd013626dd
2021-04-27 12:28:37 -07:00
Ansha Yu
e909ad2dc4 [static runtime] binding for aten::argmin_out (#56638)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56638

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.55901 ms.    35.3486%. fb::sigrid_transforms_torch_bind (1 nodes)
       0.986321 ms.    22.3636%. aten::linear (6 nodes)
       0.722277 ms.    16.3767%. aten::argmin (1 nodes)
       0.256231 ms.    5.80971%. aten::matmul (1 nodes)
       0.149653 ms.    3.39319%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.105381 ms.    2.38938%. fb::clip_ranges_gather (263 nodes)
      0.0911405 ms.    2.06649%. aten::sub (1 nodes)
      0.0605429 ms.    1.37273%. aten::repeat (1 nodes)
      0.0456569 ms.    1.03521%. aten::norm (1 nodes)
      0.0421855 ms.   0.956501%. fb::batch_box_cox (1 nodes)
      0.0370142 ms.   0.839249%. aten::__getitem__ (506 nodes)
      0.0359091 ms.   0.814193%. prim::TupleUnpack (254 nodes)
      0.0338332 ms.   0.767123%. aten::sigmoid (2 nodes)
      0.0315159 ms.   0.714582%. aten::mul (3 nodes)
      0.0297553 ms.   0.674662%. fb::offsets_to_ranges (253 nodes)
      0.0279913 ms.   0.634666%. fb::simple_embedding_bag_sum (3 nodes)
      0.0233521 ms.   0.529478%. aten::pow (1 nodes)
       0.021296 ms.    0.48286%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0208991 ms.   0.473861%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0183163 ms.   0.415298%. aten::sum (3 nodes)
      0.0164318 ms.   0.372571%. prim::DictConstruct (2 nodes)
      0.0160191 ms.   0.363211%. prim::TupleConstruct (1 nodes)
      0.0126953 ms.   0.287849%. aten::div (1 nodes)
      0.0106084 ms.   0.240532%. static_runtime::to_copy (8 nodes)
      0.0092846 ms.   0.210516%. prim::ListConstruct (4 nodes)
     0.00916175 ms.   0.207731%. fb::sigrid_hash_precompute (1 nodes)
     0.00707015 ms.   0.160307%. aten::contiguous (1 nodes)
     0.00621954 ms.    0.14102%. aten::narrow (4 nodes)
     0.00302307 ms.  0.0685441%. aten::add (1 nodes)
     0.00290759 ms.  0.0659259%. aten::full (1 nodes)
     0.00283369 ms.  0.0642503%. aten::logit (1 nodes)
     0.00239244 ms.  0.0542455%. fb::gather_ranges (4 nodes)
     0.00220181 ms.  0.0499232%. aten::relu (1 nodes)
     0.00211563 ms.  0.0479691%. static_runtime::reshape_copy (2 nodes)
      0.0020059 ms.  0.0454812%. aten::stack (1 nodes)
     0.00186682 ms.  0.0423276%. aten::clamp_min (1 nodes)
     0.00172548 ms.   0.039123%. aten::size (3 nodes)
      0.0011853 ms.  0.0268751%. aten::expand_as (1 nodes)
    0.000881784 ms.  0.0199933%. fb::clip_ranges (2 nodes)
    0.000835602 ms.  0.0189462%. fb::lengths_to_offsets (3 nodes)
    0.000444376 ms.  0.0100757%. static_runtime::flatten_copy (1 nodes)
    0.000197078 ms. 0.00446848%. prim::device (1 nodes)
         4.4104 ms. in Total
StaticRuntime setup time: 0.000702 ms
Memory allocation time: 0.00943333 ms
Memory deallocation time: 0.062704 ms
Outputs deallocation time: 0.0477171 ms
Total memory managed: 831744 bytes
Total number of reused tensors: 31
W0421 14:53:04.841202 929500 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 14:53:04.841315 929500 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 14:53:04.841341 929500 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 14:53:04.971776 929500 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 130.423. Iters per second: 7.66736
I0421 14:53:05.122830 929500 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27923172

fbshipit-source-id: 05cf5497fb6ac39dd3ff24f583607a3dff8cae95
2021-04-26 17:28:42 -07:00
Ansha Yu
0888b8726a [static runtime] binding for aten::clamp_min_out (#56635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56635

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=0 --adsfinder_compatibility=1
```

```
Time per node type:
        1.50885 ms.    36.0064%. fb::sigrid_transforms_torch_bind (1 nodes)
        0.92296 ms.    22.0251%. aten::linear (6 nodes)
       0.695455 ms.     16.596%. aten::argmin (1 nodes)
       0.237931 ms.    5.67787%. aten::matmul (1 nodes)
       0.141634 ms.    3.37989%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
      0.0925469 ms.     2.2085%. fb::clip_ranges_gather (263 nodes)
      0.0886556 ms.    2.11563%. aten::sub (1 nodes)
      0.0549624 ms.     1.3116%. aten::repeat (1 nodes)
       0.043996 ms.     1.0499%. aten::norm (1 nodes)
      0.0403472 ms.   0.962826%. fb::batch_box_cox (1 nodes)
      0.0371137 ms.   0.885664%. aten::sigmoid (2 nodes)
       0.035054 ms.   0.836512%. aten::__getitem__ (506 nodes)
      0.0338771 ms.   0.808427%. prim::TupleUnpack (254 nodes)
      0.0288516 ms.   0.688502%. aten::mul (3 nodes)
       0.026195 ms.   0.625106%. fb::offsets_to_ranges (253 nodes)
      0.0243627 ms.   0.581381%. aten::pow (1 nodes)
      0.0210347 ms.   0.501962%. fb::simple_embedding_bag_sum (3 nodes)
      0.0195358 ms.   0.466192%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0193484 ms.   0.461722%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0164265 ms.   0.391995%. aten::sum (3 nodes)
      0.0157266 ms.   0.375291%. prim::TupleConstruct (1 nodes)
      0.0156512 ms.   0.373493%. prim::DictConstruct (2 nodes)
      0.0114427 ms.   0.273062%. aten::div (1 nodes)
     0.00884876 ms.   0.211163%. static_runtime::to_copy (8 nodes)
     0.00864496 ms.   0.206299%. prim::ListConstruct (4 nodes)
     0.00803458 ms.   0.191734%. fb::sigrid_hash_precompute (1 nodes)
     0.00619933 ms.   0.147938%. aten::contiguous (1 nodes)
     0.00462827 ms.   0.110447%. aten::narrow (4 nodes)
     0.00293105 ms.  0.0699452%. aten::logit (1 nodes)
     0.00287083 ms.  0.0685082%. static_runtime::reshape_copy (2 nodes)
     0.00250605 ms.  0.0598032%. aten::add (1 nodes)
     0.00217015 ms.  0.0517875%. fb::gather_ranges (4 nodes)
     0.00202655 ms.  0.0483607%. aten::full (1 nodes)
     0.00200812 ms.  0.0479208%. aten::relu (1 nodes)
     0.00175433 ms.  0.0418644%. aten::stack (1 nodes)
     0.00174899 ms.   0.041737%. aten::clamp_min (1 nodes)
     0.00134367 ms.  0.0320646%. aten::size (3 nodes)
    0.000811416 ms.  0.0193633%. fb::clip_ranges (2 nodes)
    0.000801096 ms.   0.019117%. aten::expand_as (1 nodes)
    0.000541452 ms.   0.012921%. fb::lengths_to_offsets (3 nodes)
    0.000477838 ms.  0.0114029%. static_runtime::flatten_copy (1 nodes)
    0.000192906 ms. 0.00460342%. prim::device (1 nodes)
        4.19049 ms. in Total
StaticRuntime setup time: 0.000408 ms
Memory allocation time: 0.00895982 ms
Memory deallocation time: 0.0587527 ms
Outputs deallocation time: 0.0430985 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 28
W0421 14:33:55.610956 836281 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 14:33:55.611043 836281 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 14:33:55.611063 836281 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 14:33:55.736069 836281 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 124.995. Iters per second: 8.0003
I0421 14:33:55.874794 836281 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27922570

fbshipit-source-id: 095aa9bd0c425bc73eb48841653441d5c9e45744
2021-04-26 16:39:12 -07:00
Hao Lu
e810bed63f [Static Runtime] Clean up op implementations (#56841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56841

- Move arg checks to outside the lambda so we can perform these checks at Static Runtime initialization time
- use `optional` where possible
- support `to.other` overload, the 5-arg input load of `torch.to`.

Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled
```

Reviewed By: edvgha

Differential Revision: D27933176

fbshipit-source-id: 49d6249c8784c44146461e286e7a301596172d7c
2021-04-26 15:37:39 -07:00
Ansha Yu
690c8b434f [static runtime] binding for aten::sub_out (#56656)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56656

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```
```
Time per node type:
        1.85766 ms.    35.7817%. fb::sigrid_transforms_torch_bind (1 nodes)
         1.1238 ms.    21.6464%. aten::linear (6 nodes)
       0.858116 ms.    16.5288%. aten::argmin (1 nodes)
       0.334183 ms.    6.43694%. aten::matmul (1 nodes)
       0.173697 ms.     3.3457%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
       0.118827 ms.    2.28881%. fb::clip_ranges_gather (263 nodes)
       0.101348 ms.    1.95215%. aten::sub (1 nodes)
      0.0748209 ms.    1.44118%. aten::repeat (1 nodes)
      0.0582576 ms.    1.12214%. aten::norm (1 nodes)
      0.0474353 ms.   0.913686%. fb::batch_box_cox (1 nodes)
      0.0457588 ms.   0.881393%. aten::__getitem__ (506 nodes)
      0.0435175 ms.   0.838222%. prim::TupleUnpack (254 nodes)
      0.0425416 ms.   0.819425%. aten::sigmoid (2 nodes)
      0.0383822 ms.   0.739308%. fb::offsets_to_ranges (253 nodes)
      0.0330187 ms.   0.635996%. aten::mul (3 nodes)
       0.027534 ms.   0.530352%. fb::simple_embedding_bag_sum (3 nodes)
      0.0274914 ms.   0.529532%. aten::pow (1 nodes)
      0.0236733 ms.   0.455989%. fb::casted_batch_one_hot_lengths (1 nodes)
       0.023348 ms.   0.449723%. fb::concat_add_mul_replacenan_clip (1 nodes)
      0.0193511 ms.   0.372735%. aten::sum (3 nodes)
      0.0188839 ms.   0.363737%. prim::DictConstruct (2 nodes)
      0.0183191 ms.   0.352858%. prim::TupleConstruct (1 nodes)
      0.0119029 ms.    0.22927%. aten::div (1 nodes)
      0.0103263 ms.   0.198902%. static_runtime::to_copy (8 nodes)
     0.00977658 ms.   0.188314%. prim::ListConstruct (4 nodes)
     0.00924042 ms.   0.177986%. fb::sigrid_hash_precompute (1 nodes)
     0.00692162 ms.   0.133322%. aten::contiguous (1 nodes)
     0.00567485 ms.   0.109307%. aten::narrow (4 nodes)
     0.00362285 ms.  0.0697823%. aten::logit (1 nodes)
     0.00329995 ms.  0.0635627%. aten::add (1 nodes)
     0.00285633 ms.  0.0550178%. aten::full (1 nodes)
     0.00268469 ms.  0.0517118%. fb::gather_ranges (4 nodes)
     0.00248577 ms.  0.0478803%. aten::stack (1 nodes)
     0.00241782 ms.  0.0465715%. aten::relu (1 nodes)
     0.00233674 ms.  0.0450096%. aten::clamp_min (1 nodes)
     0.00222238 ms.  0.0428068%. static_runtime::reshape_copy (2 nodes)
     0.00171177 ms.  0.0329716%. aten::size (3 nodes)
     0.00120008 ms.  0.0231155%. aten::expand_as (1 nodes)
     0.00112628 ms.  0.0216942%. fb::clip_ranges (2 nodes)
     0.00103193 ms.  0.0198768%. fb::lengths_to_offsets (3 nodes)
    0.000598624 ms.  0.0115305%. static_runtime::flatten_copy (1 nodes)
    0.000236196 ms. 0.00454954%. prim::device (1 nodes)
        5.19164 ms. in Total
StaticRuntime setup time: 0.000868 ms
Memory allocation time: 0.0109619 ms
Memory deallocation time: 0.071791 ms
Outputs deallocation time: 0.0560187 ms
Total memory managed: 1232320 bytes
Total number of reused tensors: 32
W0421 17:40:52.053653 1746499 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 17:40:52.053757 1746499 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 17:40:52.053779 1746499 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 17:40:52.185776 1746499 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 131.985. Iters per second: 7.57661
I0421 17:40:52.337853 1746499 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27929253

fbshipit-source-id: 5a7984ba3ce2d6d4bce0a0ab6c5e09e8c037b44e
2021-04-22 08:40:35 -07:00
Ansha Yu
81b59211d4 [static runtime] binding for aten::div_out (#56653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56653

Test Plan:
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=500 --warmup_iters=500 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=1 --do_profile=1 --adsfinder_compatibility=1
```

```
Time per node type:
        1.48563 ms.    35.9861%. fb::sigrid_transforms_torch_bind (1 nodes)
        0.92385 ms.    22.3783%. aten::linear (6 nodes)
       0.681066 ms.    16.4974%. aten::argmin (1 nodes)
       0.239311 ms.    5.79679%. aten::matmul (1 nodes)
       0.140157 ms.    3.39501%. fb::clip_ranges_gather_sigrid_hash_v3 (77 nodes)
      0.0951568 ms.    2.30497%. fb::clip_ranges_gather (263 nodes)
      0.0835801 ms.    2.02455%. aten::sub (1 nodes)
       0.054081 ms.       1.31%. aten::repeat (1 nodes)
      0.0424465 ms.    1.02818%. aten::norm (1 nodes)
      0.0389049 ms.   0.942389%. fb::batch_box_cox (1 nodes)
      0.0346992 ms.   0.840514%. aten::__getitem__ (506 nodes)
      0.0341335 ms.    0.82681%. prim::TupleUnpack (254 nodes)
      0.0306839 ms.   0.743252%. aten::sigmoid (2 nodes)
      0.0280489 ms.   0.679426%. aten::mul (3 nodes)
      0.0265321 ms.   0.642684%. fb::offsets_to_ranges (253 nodes)
      0.0207622 ms.    0.50292%. aten::pow (1 nodes)
      0.0202067 ms.   0.489465%. fb::simple_embedding_bag_sum (3 nodes)
      0.0195497 ms.    0.47355%. fb::casted_batch_one_hot_lengths (1 nodes)
      0.0184351 ms.   0.446551%. fb::concat_add_mul_replacenan_clip (1 nodes)
       0.016382 ms.    0.39682%. aten::sum (3 nodes)
      0.0158651 ms.   0.384299%. prim::TupleConstruct (1 nodes)
      0.0150918 ms.   0.365567%. prim::DictConstruct (2 nodes)
     0.00858005 ms.   0.207833%. aten::div (1 nodes)
     0.00810684 ms.   0.196371%. fb::sigrid_hash_precompute (1 nodes)
     0.00796325 ms.   0.192893%. static_runtime::to_copy (8 nodes)
     0.00782038 ms.   0.189432%. prim::ListConstruct (4 nodes)
      0.0057504 ms.   0.139291%. aten::contiguous (1 nodes)
      0.0044688 ms.   0.108247%. aten::narrow (4 nodes)
     0.00284054 ms.   0.068806%. aten::logit (1 nodes)
     0.00265049 ms.  0.0642024%. aten::add (1 nodes)
     0.00216242 ms.    0.05238%. aten::full (1 nodes)
     0.00207732 ms.  0.0503187%. aten::relu (1 nodes)
     0.00198412 ms.   0.048061%. fb::gather_ranges (4 nodes)
     0.00176954 ms.  0.0428632%. aten::stack (1 nodes)
     0.00175913 ms.  0.0426112%. static_runtime::reshape_copy (2 nodes)
      0.0016996 ms.  0.0411692%. aten::clamp_min (1 nodes)
     0.00128528 ms.  0.0311331%. aten::size (3 nodes)
    0.000849156 ms.   0.020569%. aten::expand_as (1 nodes)
    0.000757672 ms.   0.018353%. fb::clip_ranges (2 nodes)
    0.000596224 ms.  0.0144423%. fb::lengths_to_offsets (3 nodes)
    0.000442632 ms.  0.0107218%. static_runtime::flatten_copy (1 nodes)
    0.000196158 ms. 0.00475151%. prim::device (1 nodes)
        4.12833 ms. in Total
StaticRuntime setup time: 0.000451 ms
Memory allocation time: 0.0089336 ms
Memory deallocation time: 0.0578358 ms
Outputs deallocation time: 0.0431742 ms
Total memory managed: 947328 bytes
Total number of reused tensors: 31
W0421 16:56:34.220682 1522800 PyTorchPredictorContainer.cpp:200] Failed to load metadata file
W0421 16:56:34.220772 1522800 PyTorchPredictorContainer.cpp:457] Couldn't find model param config file xl_model_weights/model_param_config
I0421 16:56:34.220791 1522800 PyTorchPredictorBenchLib.cpp:137] PyTorch predictor: number of prediction threads 1
I0421 16:56:34.366667 1522800 PyTorchPredictorBenchLib.cpp:230] PyTorch run finished. Milliseconds per iter: 145.863. Iters per second: 6.85573
I0421 16:56:34.514202 1522800 PtVsBlackBoxPredictorBenchLib.cpp:132] Finished comparing PT static runtime and jit interpreter results
```

Reviewed By: hlu1

Differential Revision: D27927731

fbshipit-source-id: 595883a31ba0cadf6449799d47bf2294a1d05b41
2021-04-22 01:38:24 -07:00
Ansha Yu
7ae45403a1 [static runtime] support aten::__getitem__ natively (#55310)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55310

Test Plan:
Run on the dper generated local/local_ro model
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.local.local_ro.pt --pt_inputs=/data/users/ansha/tmp/adfinder/aug_1x/210616848_0.predictor.disagg.input_data.container.pt --iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=1 --pt_cleanup_activations=true --pt_enable_out_variant=1 --pt_optimize_memory=1 --compare_results=0 --do_profile=0 --adsfinder_compatibility=1
```

Reviewed By: hlu1

Differential Revision: D27569662

fbshipit-source-id: df68c2fdd95e39a30aec35ddbaf1f5df0bc3a3da
2021-04-19 23:08:19 -07:00
Edward Yang
f17c9ea2ed Port all unary float functions to structured (#56082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56082

The native_functions.yaml changes were done by codemod using the
following script:

```
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import CommentMark
from tools.codegen.model import *  # noqa: F403

with open("aten/src/ATen/native/native_functions.yaml", "r") as f:
    contents = f.read()

yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)

convert = '''\
acos
acosh
asin
asinh
atan
atanh
cos
cosh
digamma
erf
erfc
erfinv
exp
expm1
exp2
lgamma
log
log10
log1p
log2
reciprocal
sigmoid
sin
sinc
sinh
special_entr
sqrt
tan
tanh'''.split()

for e in r:
    f = NativeFunction.from_yaml(e, Location("", 0))
    if f.structured or f.structured_delegate is not None:
        continue
    n = f.func.name.name.base
    if n not in convert:
        continue
    # mutate e to make changes
    if f.func.kind() == SchemaKind.out:
        e.insert(1, 'structured', True)
        e.insert(2, 'structured_inherits', 'TensorIteratorBase')
    else:
        # TODO: The .out overload assumption is not sound in general
        e.insert(1, 'structured_delegate', f'{n}.out')

        e['dispatch'].pop('CPU', None)
        e['dispatch'].pop('CUDA', None)
        e['dispatch'].pop('CPU, CUDA', None)
        e['dispatch'].pop('CompositeExplicitAutograd', None)

        *_, last_k = e.keys()
        needs_fixup = False

        if not e['dispatch']:
            if last_k == 'dispatch':
                needs_fixup = True
            del e['dispatch']

        # Manually fix up newlines at the end, because ruamel
        # made some bad life choices about where to associate trailing
        # whitespace for nested dicts; see
        # https://stackoverflow.com/questions/42172399/modifying-yaml-using-ruamel-yaml-adds-extra-new-lines
        if needs_fixup:
            *_, last_k = e.keys()
            # post_key, pre_key, post_value, pre_value
            e.ca.items[last_k] = [None, None, CommentToken('\n\n', CommentMark(0), None), None]

with open("aten/src/ATen/native/native_functions.yaml.new", "w") as f:
    yaml.dump(r, f)
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D27777769

Pulled By: ezyang

fbshipit-source-id: 1ecbac7cb3e0093167bb61c7d2b1ecb95b8ae17c
2021-04-15 16:06:42 -07:00
CodemodService FBSourceClangFormatLinterBot
2f895f790a [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D27789747

fbshipit-source-id: ef4882e92d7755669083573c43ae6c5088bf01ab
2021-04-15 04:27:27 -07:00
Kurt Mohler
3fe4718d16 Add padding_idx argument to EmbeddingBag (#49237)
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.

This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.

Fixes https://github.com/pytorch/pytorch/issues/3194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237

Reviewed By: walterddr, VitalyFedyunin

Differential Revision: D26948258

Pulled By: jbschlosser

fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
2021-04-14 09:38:01 -07:00
Peng Wu
18662d4321 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809

[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime'

Reviewed By: bwasti

Differential Revision: D27411416

fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c
2021-04-13 11:04:47 -07:00
Hao Lu
c3d0607ffa [Static Runtime] Make sure the copy version of the op exist in ReplaceWithCopy (#55337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337

`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.

Test Plan:
This fixed two of the broken tests:
```
    ✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
    ✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```

Reviewed By: ajyu

Differential Revision: D27577066

fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
2021-04-06 04:25:04 -07:00
Richard Barnes
d690973295 irange on int64_t (#55148)
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27447811

fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
2021-04-05 16:14:00 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Hao Lu
b74795c460 [Pyper] resize_as_ -> resize_ (#55098)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55098

resize_as_ still goes through the dispatcher because it calls tensor.resize_. We can easily call resize_ directly while bypassing the dispatcher.

Reviewed By: swolchok

Differential Revision: D27457894

fbshipit-source-id: 8a5da185d1a6addafbf4915e29613013451b5e43
2021-04-01 11:17:40 -07:00
Ansha Yu
0cfd9e881f [static runtime] fix out variant for 4bit embedding bag (#55096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55096

There were issues with D26138322 (5b0a6482c1) that we didn't catch the first time around.
This (rebased on top of the to_copy fixes)  fixes the converted remote_ro c2/pt output comparison

Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=0 --benchmark_c2_predictor=1 --do_benchmark=1
```

Reviewed By: hlu1

Differential Revision: D27477104

fbshipit-source-id: 5a95dfa7eae23566fadc3fec323ad03a34e6734d
2021-04-01 07:33:02 -07:00
Wenlei Xie
67d44377e3 Remove hacky wrapper for about 100 kernels (#54751)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54751

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 125141211

Test Plan:
buck build //caffe2/aten/...
BUILD_TENSOREXPR_BENCHMARK=ON BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install

Reviewed By: smessmer

Differential Revision: D27353530

fbshipit-source-id: 66f83edfb1016ca0040fb603e43604cd2db02c4c
2021-03-29 12:06:34 -07:00
Wenlei Xie
593295daac Migrate kernels with TensorOptions to C10 full dispatcher (#54539)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54539

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54468

ghstack-source-id: 125018630

# Facebook:
The following 2 files are changed on fb side:
```
// Should be hidden
```

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27273744

fbshipit-source-id: 35c1bff63189477645008caaf0dc794096e3fcc4
2021-03-26 13:55:22 -07:00
Wenlei Xie
53596cdb73 Remove hacky wrapper for about 100 kernels (#54367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54367

Codemod commands generated by https://github.com/pytorch/pytorch/pull/54098
ghstack-source-id: 124804544

Test Plan: buck build //caffe2/aten/...

Reviewed By: smessmer

Differential Revision: D27210057

fbshipit-source-id: 368dc77843468cfc44535488a040dbc2cb67208d
2021-03-25 10:00:16 -07:00
Ansha Yu
afe339d7dd [static runtime] support DictConstruct (#54438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438

August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime

Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
    cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```

```
TEST trace
Total num requests                                   100
Num exceptions                                         0
Latency us avg                                    180965
Latency us p25                                     89785
Latency us p50                                    131240
Latency us p75                                    146621
Latency us p90                                    158378
Latency us p95                                    166628
Latency us p99                                   1886680
Latency us p100                                  3803252
Server latency us avg                              91554
Server latency us p25                              51447
Server latency us p50                              86371
Server latency us p75                              95229
Server latency us p90                             102706
Server latency us p95                             116023
Server latency us p99                             557017
Server latency us p100                            716319
Num rankUnits avg                                     28
```

Reviewed By: hlu1

Differential Revision: D27236682

fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
2021-03-23 21:20:03 -07:00
Hao Lu
52abd3bd7b [Static Runtime] Fix bug in reshape_copy (#54467)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54467

`at::native::copy_` requires src/dest to have the same sizes, which isn't true in reshape.

Test Plan: Added new test cases to cover this case.

Reviewed By: ajyu

Differential Revision: D27249617

fbshipit-source-id: 2c95175fa8564b3c648979445ad4314f97818852
2021-03-22 22:20:55 -07:00
Hao Lu
8294bff20d [StaticRuntime] Copy version of reshape/flatten (#54353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353

The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor.

Reviewed By: ajyu, edvgha

Differential Revision: D26358525

fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87
2021-03-20 16:55:30 -07:00
Brian Hirsh
779cae9e42 port at::pow to structured (#53669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53669

This PR does two things:
* Ports `pow` to be structured
* Fixes a bug with how pow handles mixed cpu and cuda tensors

**bug fix**
Pow is a binary op, and all binary ops that use TensorIterator are currently written to handle the case when one of the inputs is a CUDA tensor, and the other is a zero-dimensional cpu tensor.

`pow` incidentally only handles one of the two cases: it fails when the CUDA tensor is passed as the exponent, e.g. `at::pow(torch.tensor(2.0, device='cpu'), torch.tensor([2, 2], device='cuda'))`. Porting `pow` to structured happened to change the error that was outputted from a `TORCH_CHECK` in TensorIterator to an `INTERNAL_ASSERT` in loop.cuh, so I ended up trying to fix the error and update the tests. I added more details in a comment on the PR.

**notes on the structured port**
Pow is a little weird, so I wrote down a couple of issues I noticed during the port:
* Multiple independent overloads. `pow` has two overloads that have their own cpu/cuda kernels, meaning one doesn't call the other. I have to update the names of the kernel overloads to make the compiler happy, since the codegen would otherwise try to generate two classes with the same name. `pow` actually has 3 overloads that all have `out` variants, so I ported all 3 to structured- one of them just happens to redispatch one of the others in most cases.
* Name propagation. Is name propagation implemented per operator? Or is expected to work for most/all ops by default. Right now it looks like it happens for TensorIterator ops by default. For ops that don't use TensorIterator, we need to explicitly pass the names through to the `set_output()` call in the meta function. This happened to matter for `pow` because it has 3 overloads, but only two of them directly use TensorIterator. I had to pass names directly to `set_output` in the 3rd overload to make tests happy.
*  Lack of `const Tensor &` in the C++ API. It's a goal to slowly make all `Tensor &` arguments const as part of the structured port, but in this case I needed to explicitly cast constness away because one structured kernel called back into the C++ API, which still has ordinary `Tensor &` arguments. This probably isn't something we'll fix soon, since we have boxing logic that actually relies on the `Tensor &` / `const Tensor &` distinction in some places.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029821

Pulled By: bdhirsh

fbshipit-source-id: c1786e770de6e6c2474b9a48210b88057ab1018e
2021-03-19 14:30:48 -07:00
Hao Lu
f1cbd10276 [PyPer] Port c2 add to pt (#54229)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54229

Because caffe2 add uses Eigen for add with broadcasting which is not well supported by OSS PyTorch, it's easier to just keep the `c2_add_out` internal for now. Caffe2 does use mkl add when the input dims of A and B are the same and there is no broadcasting needed.

Reviewed By: bertmaher

Differential Revision: D27036279

fbshipit-source-id: 49f0ec5407ea1f641896f054cad2283faed81687
2021-03-19 12:45:11 -07:00
Brian Hirsh
bc4f521178 port at::mul to structured (#52692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692

Porting `at::mul` to structured.

One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. *Technically*, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D27029822

Pulled By: bdhirsh

fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4
2021-03-19 11:34:33 -07:00
Wenlei Xie
79534867ac Migrate about 100 kernel to C10 full dispatcher (#54109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54109

Codemod command generated by https://github.com/pytorch/pytorch/pull/54098

ghstack-source-id: 124114894

Test Plan: CI

Reviewed By: smessmer

Differential Revision: D27100359

fbshipit-source-id: 8338405274a2a020856af6e4a35a2fb21438f2a8
2021-03-17 13:35:39 -07:00
Edvard Ghazaryan
ce0fd095a8 Implemented embedding_bag for SR (#52429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52429

Implemented embedding_bag for supporting out version in SR

Befor:Milliseconds per iter: 1.15443. Iters per second: 866.226

 After:  Milliseconds per iter: 1.14791. Iters per second: 871.149

Test Plan:
buck test caffe2/test:nn
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D26089498

fbshipit-source-id: c9ba7068d5aa696c8f37a4846d8e80c6379538d2
2021-03-12 17:52:27 -08:00
Hao Lu
a8ecf306da [Static Runtime] Remove dead code (#53588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53588

Remove `SRViewOperatorRegistry` and related code now that it's no longer needed.

Reviewed By: swolchok

Differential Revision: D26901367

fbshipit-source-id: fa73501cd785d4b89466cda81481aea892f8241f
2021-03-09 13:36:41 -08:00
Ansha Yu
7c0a4e78ca [static runtime] convert to->to_copy (#53524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53524

Add to->to_copy in the ReplaceWithCopy pass for playing well with
AliasDb

Test Plan:
Run bench with CastedBatchOneHot fusion off
(https://www.internalfb.com/intern/diff/view-version/123230476/),
on adindexer and adfinder models

Reviewed By: hlu1

Differential Revision: D26887050

fbshipit-source-id: 3f2fb9e27783bcdeb91c8b4181575f059317aff1
2021-03-08 16:19:03 -08:00
Hao Lu
3236efa4de [Static Runtime] Call native resize_/resize_as_ as much as possible (#53425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53425

t.resize_ goes through the dispatcher. Replace with direct native calls
- t.resize_/resize_as_ -> at::native::resize_/resize_as_
- t.resize_({0}) -> fastResizeToZero(t)

Reviewed By: ajyu, edvgha

Differential Revision: D26836278

fbshipit-source-id: d1a95240099a35f5ece0de2eea50620ba8054ee5
2021-03-06 21:12:23 -08:00
Bram Wasti
56f8379802 [static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564

Constructor logic was spread throughout InferenceModule and StaticRuntime.  This diff unifies the two.  After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.

This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.

```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```

Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25592967

fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
2021-03-05 10:15:26 -08:00
Hao Lu
63e0e88ccc [PyPer] More at::empty -> at::detail::empty_cpu (#53333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53333

- Add more variants to `create_empty_from` to take more args, like dtype/layout/device.
- Clean up stray at::empty uses, mostly in the out variants.

Reviewed By: ajyu

Differential Revision: D26799900

fbshipit-source-id: 6676d8043fead63208913ef3a28cabbae76e46bb
2021-03-05 00:16:51 -08:00
Ansha Yu
36180c1322 [static runtime] aten::to copy out variant (#52343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52343

aten::to returns self when the TensorOptions match and copy is set to false. For static runtime, always copy. There isn't a separate op for aten::to copy, but instead the same function
with different arguments.

Test Plan:
On AdFinder local_ro:

Before:
0.896742
0.00824827 ms.    0.92773%. aten::to (5 nodes)

After:
0.88233
0.0056607 ms.   0.644675%. aten::to (5 nodes)

buck test mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D26477980

fbshipit-source-id: 8e8448092adff38c141af1ce27a10acd39c07dd1
2021-03-04 17:30:15 -08:00
Ansha Yu
d98839e53e [static runtime] register pow out variant (#52454)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52454

Test Plan:
adfinder local net
Before:
7.13307 ms/iter
0.0222672 ms.   0.311136%. aten::pow (1 nodes)
After:
7.10623 ms/iter
0.0174462 ms.   0.242774%. aten::pow (1 nodes)

Reviewed By: malfet, hlu1

Differential Revision: D26521717

fbshipit-source-id: 8d9279b59d37c8786a9eeccd0f54bd84c400c128
2021-03-03 21:33:11 -08:00
Scott Wolchok
38a34887ac [PyTorch] Fix missing move in {List,Tuple}Construct (#53206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53206

Copying the List in ListConstruct is 1 extra refcount bump. Copying the vector in TupleConstruct is 1 extra bump per tuple element.
ghstack-source-id: 123001815

Test Plan: Don't have a precise measurement but it's very roughly 0.5% off total time for AdIndexer inline_cvr based on wall time, and more like 1.2% based on change in perf profile.

Reviewed By: hlu1

Differential Revision: D26790670

fbshipit-source-id: 697ef82fe72a85719bf8ce28f2bb87fe56bbd8ad
2021-03-03 19:28:44 -08:00
Hao Lu
c0b31a5ba7 [StaticRuntime] Clean up (#53096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53096

- auto[&] -> const auto[&]
- clean up size() calls

Test Plan:
```
buck test //caffe2/torch/fb/sparsenn:test
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26747001

fbshipit-source-id: 6ec81310747d86f7c5d2d17202eef7e299ef610c
2021-03-02 18:51:09 -08:00
Bram Wasti
d4e64dad15 [static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684

With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations.  `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations.  Ideally every node will run out of place every time.  Afaik, there's never a reason to disable it and we may want to remove that functionality.

This diff does introduce a "leak" in the memory management - containers are not cleaned up.  This only happens when out variants are enabled

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled

Reviewed By: maratsubkhankulov, hlu1

Differential Revision: D26515801

fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
2021-03-02 09:55:25 -08:00
Bram Wasti
2d67b76fa6 [static runtime] Add Alias analysis to Memory Management/Planning (#50060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060

Aliasing is currently mishandled in SR.

This diff fixes that issue entirely and allows us to avoid hard coded "view" registration.  I'll remove the macro in a follow up diff.

However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias."  Some care will need to be taken to actually make sure this is enforced going forward.

This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04

$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: eellison

Differential Revision: D25581156

fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
2021-03-02 09:53:32 -08:00