Commit Graph

59 Commits

Author SHA1 Message Date
Hao Lu
33f206b865 [StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447

MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.

Test Plan: CI

Reviewed By: ajyu

Differential Revision: D27840361

fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
2021-04-20 23:04:01 -07:00
Peng Wu
1a116a9332 [Static runtime] Add optimize_graph_output_memory flag (#55811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811

- Added manage_graph_output_memory flag to opts (default false)
- Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory
- Minor refactoring for readability

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime

Reviewed By: hlu1

Differential Revision: D27573780

fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91
2021-04-14 15:41:18 -07:00
Peng Wu
18662d4321 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809

[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime'

Reviewed By: bwasti

Differential Revision: D27411416

fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c
2021-04-13 11:04:47 -07:00
Ailing Zhang
c6d9ca0c2b [reland]Replace AutoNonVariableTypeMode with InferenceMode in static runtime. (#55731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55731

Forgot to export the diff in my last one. Retry...

Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27694660

fbshipit-source-id: b351338fa789b9e9c7337df9b1bc1bc0fc387f5d
2021-04-12 09:48:20 -07:00
Ailing Zhang
5a8cdc2fdb Revert D27691509: Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan: revert-hammer

Differential Revision:
D27691509 (d695ba94f6)

Original commit changeset: d43db028a399

fbshipit-source-id: 8cfa2f821ef3251b323483691672ed70858d9d68
2021-04-09 20:36:20 -07:00
Ailing Zhang
d695ba94f6 Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27691509

fbshipit-source-id: d43db028a399bb02166a539577f6922237145f83
2021-04-09 20:04:00 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Peng Wu
fe2c1268b7 More name refactoring of memory planning codes to make it more readable (#54272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54272

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D27233881

fbshipit-source-id: f257f16ac0684df055961e539f17d002cb8f1bfe
2021-03-24 19:52:35 -07:00
Ansha Yu
afe339d7dd [static runtime] support DictConstruct (#54438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438

August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime

Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
    cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```

```
TEST trace
Total num requests                                   100
Num exceptions                                         0
Latency us avg                                    180965
Latency us p25                                     89785
Latency us p50                                    131240
Latency us p75                                    146621
Latency us p90                                    158378
Latency us p95                                    166628
Latency us p99                                   1886680
Latency us p100                                  3803252
Server latency us avg                              91554
Server latency us p25                              51447
Server latency us p50                              86371
Server latency us p75                              95229
Server latency us p90                             102706
Server latency us p95                             116023
Server latency us p99                             557017
Server latency us p100                            716319
Num rankUnits avg                                     28
```

Reviewed By: hlu1

Differential Revision: D27236682

fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
2021-03-23 21:20:03 -07:00
Peng Wu
c06d979731 [Static Runtime] Name refactoring to make MemoryPlanning more readable (#54045)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54045

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D27233880

fbshipit-source-id: 43b38901d8cfea0941a1a2934997a08027b57b6d
2021-03-23 14:28:43 -07:00
Hao Lu
ca429fedd3 [StaticRuntime] Fuse SigridTransforms + ListUnpack (#53920)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920

Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.

The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.

Reviewed By: ajyu

Differential Revision: D26220546

fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
2021-03-17 19:58:02 -07:00
Hao Lu
04d5278cb6 [Static Runtime] Only run ReplaceWithCopy pass when enable_out_variant is true (#54111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111

If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.

Reviewed By: edvgha

Differential Revision: D27036077

fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
2021-03-16 22:06:33 -07:00
Hao Lu
4932342363 [Static Runtime] Fix bug in ClipRangesGatherRangesX2SigridHash (#53799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799

Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.

Reviewed By: ajyu

Differential Revision: D26920008

fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
2021-03-12 13:15:44 -08:00
Bram Wasti
56f8379802 [static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564

Constructor logic was spread throughout InferenceModule and StaticRuntime.  This diff unifies the two.  After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.

This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.

```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```

Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25592967

fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
2021-03-05 10:15:26 -08:00
Hao Lu
35364c3641 [static runtime] Enable ClipRangesGatherRangesX2SigridHash fusion for SigridHashPrecompute (#53324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53324

Reviewed By: maratsubkhankulov

Differential Revision: D26833478

fbshipit-source-id: 55ab63faf5b535f2acd2ec5dc5721f5b692832d7
2021-03-04 22:01:08 -08:00
Hao Lu
ac668c55e5 [Static Runtime] Remove dead code in MemoryPlanner and rename unmanaged_value_set to unmanaged_ivalue_set
Test Plan:
```
buck test mode/opt //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: bwasti

Differential Revision: D26827700

fbshipit-source-id: a8696af3e1d2b504fa5754f823b389d45b48af38
2021-03-04 17:37:43 -08:00
Hao Lu
d90d7245f4 [PyPer] Optimize sigrid_hash (#53065)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53065

Reviewed By: ajyu

Differential Revision: D26563512

fbshipit-source-id: a1a76f92ba500605ab2e3370737bd3965d81deb1
2021-03-03 01:31:53 -08:00
Bram Wasti
d4e64dad15 [static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684

With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations.  `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations.  Ideally every node will run out of place every time.  Afaik, there's never a reason to disable it and we may want to remove that functionality.

This diff does introduce a "leak" in the memory management - containers are not cleaned up.  This only happens when out variants are enabled

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled

Reviewed By: maratsubkhankulov, hlu1

Differential Revision: D26515801

fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
2021-03-02 09:55:25 -08:00
Bram Wasti
2d67b76fa6 [static runtime] Add Alias analysis to Memory Management/Planning (#50060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060

Aliasing is currently mishandled in SR.

This diff fixes that issue entirely and allows us to avoid hard coded "view" registration.  I'll remove the macro in a follow up diff.

However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias."  Some care will need to be taken to actually make sure this is enforced going forward.

This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04

$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5

$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: eellison

Differential Revision: D25581156

fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
2021-03-02 09:53:32 -08:00
Hao Lu
7a178a8a52 [Static Runtime] Add memoray alloc/dealloc time to benchmark (#52902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52902

Add more metrics to track memory_alloc_time, memory_dealloc_time, and output_dealloc_time.

Reviewed By: maratsubkhankulov

Differential Revision: D26660715

fbshipit-source-id: 96c6cfac2d2ec66d4c31c84129721a846c3914f0
2021-02-25 22:55:14 -08:00
Hao Lu
72f9b3c8d5 [StaticRuntime] Add function to check for memory leak (#52342)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52342

Reviewed By: yinghai

Differential Revision: D26420826

fbshipit-source-id: 4023f80fadd21e192afa485d96acd37c845146be
2021-02-19 19:45:09 -08:00
Scott Wolchok
edf8130e9e [PyTorch] Add set_data_ptr_noswap & use where possible (#52244)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52244

`StorageImpl::set_data_ptr` returns the old pointer and thus has to do extra
work. Found because `std::swap<at::DataPtr>` was showing up in
profiling, although at < 1%.
ghstack-source-id: 121795131

Test Plan:
Run AdIndexer benchmark under `perf stat`.

Before:
```
         17,990.01 msec task-clock                #    0.998 CPUs utilized            ( +-  0.43% )
             6,550      context-switches          #    0.364 K/sec                    ( +- 31.42% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +-  7.14% )
           103,820      page-faults               #    0.006 M/sec                    ( +-  2.47% )
    35,610,511,494      cycles                    #    1.979 GHz                      ( +-  0.40% )  (50.03%)
    71,651,045,779      instructions              #    2.01  insn per cycle           ( +-  0.07% )  (50.02%)
    11,679,947,910      branches                  #  649.246 M/sec                    ( +-  0.10% )  (50.03%)
        69,088,927      branch-misses             #    0.59% of all        branches          ( +-  0.24% )  (50.06%
```

After:
```
         17,896.20 msec task-clock                #    0.999 CPUs utilized            ( +-  0.24% )
             4,011      context-switches          #    0.224 K/sec                    ( +- 27.77% )
                 3      cpu-migrations            #    0.000 K/sec
           100,350      page-faults               #    0.006 M/sec                    ( +-  1.58% )
    35,418,702,208      cycles                    #    1.979 GHz                      ( +-  0.23% )  (50.05%)
    71,449,334,935      instructions              #    2.02  insn per cycle           ( +-  0.09% )  (50.03%)
    11,652,819,899      branches                  #  651.134 M/sec                    ( +-  0.12% )  (50.04%)
        69,744,411      branch-misses             #    0.60% of all branches          ( +-  0.53% )  (50.06%)
```

Cycles difference is within the noise, but it looks like we have an
0.28% instruction count win, which is outside the noise (and fits with
intuition that this should be better).

Reviewed By: hlu1

Differential Revision: D26437297

fbshipit-source-id: bf0fceccf6ad78f1497b03ccb4cdfd1a21c6846c
2021-02-17 12:42:21 -08:00
Hao Lu
4949eea0ff [StaticRuntime] Clean up output references and remove dead code (#52237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237

Redo D26331506 (4c58be4573). Get rid of `nodiscard` which broke OSS CI.

- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.

This diff does not try to fix the alias problem with the MemoryPlanner.

Reviewed By: swolchok

Differential Revision: D26432539

fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411
2021-02-13 20:05:28 -08:00
Mike Ruberry
992d251c39 Revert D26333953: [StaticRuntime] Clean up output references and remove dead code
Test Plan: revert-hammer

Differential Revision:
D26333953 (0c9d72b5e1)

Original commit changeset: cadc0595ad6a

fbshipit-source-id: 75d0b33099342653cd8867b129139325789aee6c
2021-02-12 02:12:31 -08:00
Hao Lu
0c9d72b5e1 [StaticRuntime] Clean up output references and remove dead code (#51991)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991

- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.

This diff does not try to fix the alias problem with the MemoryPlanner.

(Note: this ignores all push blocking failures!)

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```

Reviewed By: bwasti

Differential Revision: D26333953

fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f
2021-02-12 01:11:08 -08:00
Hao Lu
4c58be4573 [StaticRuntime] Clean up input references (#51952)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952

StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check.

Will clean up output tensors in separate diffs.

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```

Reviewed By: bwasti

Differential Revision: D26331506

fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b
2021-02-11 13:46:19 -08:00
Hao Lu
11cda929fb [StaticRuntime] Fix bug in MemoryPlanner (#51342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342

There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.

```
  def forward(self, a: Tensor, shape: List[int]):
      b = a.reshape(shape)
      return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.

To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.

Test Plan:
Add unit test to enforce the constness of inputs

```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26144203

fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
2021-01-29 21:16:02 -08:00
Hao Lu
d035d56bfb [StaticRuntime] Add out variant for reshape and flatten (#51249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249

- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.

Reviewed By: ajyu

Differential Revision: D25992202

fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
2021-01-27 22:44:11 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Bram Wasti
ace1680b68 [static runtime] Remove register concept by giving ownership to the nodes (#50050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050

Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.

Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime

Reviewed By: hlu1

Differential Revision: D25571181

fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
2021-01-07 10:19:58 -08:00
Bram Wasti
3ffe9e0f43 [static runtime] refine fusion group (#49340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49340

This refines the fusion group to include on certain types of operations.  We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented)

Test Plan:
```
buck test mode/no-gpu caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ZolotukhinM

Differential Revision: D25520105

fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9
2020-12-15 12:57:35 -08:00
Scott Wolchok
743a4ef0ae [PyTorch] Enable AutoNonVariableTypeMode in static runtime (#49199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49199

This should save us an extra round of dispatch for resize_,
resize_as_, detach_, and copy_, at the cost of disabling profiling and
tracing. I'm told that static runtime has its own per-op profiling and
we don't need tracing.
ghstack-source-id: 118348314

Test Plan:
Code review to confirm lack of need for profiling &
tracing, and that there isn't a different switch we should be using
instead.

Internal benchmarks -- seeing 11-12% improvement in overall runtime

Reviewed By: hlu1

Differential Revision: D25476819

fbshipit-source-id: 71e2c919b386b25c41084e2e4a54fe765a4f8f22
2020-12-10 21:51:59 -08:00
Bram Wasti
f4226b5c90 [static runtime] add static subgraph fusion pass (#49185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185

This diff adds a fusion feature that will let us use static runtime for *parts* of the graph.  This will prove useful in cases where fully eliminating control flow is hard etc.

TODO:
[x] factor out into separate fusion file
[x] add python test case
[x] add graph that isn't fully lowered test case
[x] add graph that has weird list/tuple outputs test case

the loop example looks quite good:
```
graph(%a.1 : Tensor,
      %b.1 : Tensor,
      %iters.1 : int):
  %12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
  %c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1)
  %c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
    block0(%i : int, %c.12 : Tensor):
      %c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1)
      -> (%12, %c.10)
  return (%c)
with prim::StaticSubgraph_0 = graph(%0 : Tensor,
      %4 : Tensor):
  %5 : int = prim::Constant[value=2]()
  %6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12
  %2 : int = prim::Constant[value=1]()
  %c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8
  return (%c.2)
with prim::StaticSubgraph_1 = graph(%1 : Tensor,
      %7 : Tensor,
      %8 : Tensor):
  %9 : int = prim::Constant[value=1]()
  %c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12
  %5 : int = prim::Constant[value=2]()
  %c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8
  %2 : int = prim::Constant[value=1]()
  %c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8
  return (%c.10)
```

(Note: this ignores all push blocking failures!)

Test Plan:
buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest

buck test mode/no-gpu caffe2/test:static_runtime

Reviewed By: bertmaher

Differential Revision: D25385702

fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098
2020-12-10 14:03:11 -08:00
Bram Wasti
274ce26fd8 [static runtime] Add Internal Ops to the registry (#48616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616

This adds a couple of _out variants and then registers them to the registry.

I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).

In the future we can change this (with this D25062301)

after removing `RecordFunction`, we see these results

```
BS=20
 ---
caffe2:           0.651617 ~ 0.666354
static runtime:   0.753481
pytorch:          0.866658

BS=1
 ---
caffe2:           0.0858684 ~ 0.08633
static runtime:   0.209897
pytorch:          0.232694
```

Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)

Reviewed By: hlu1

Differential Revision: D25066823

fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80
2020-12-08 09:32:38 -08:00
Ansha Yu
07978bd62e [static runtime] fuse inference ops (1) (#48948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48948

Fuse inference ops for the following inside static runtime:
ConcatAddMulReplaceNaNClip
CastedBatchOneHotLengths
ConcatBatchMatMulBatchGather

TODO:
1. add unit tests
2. add more restrictions on the graph transform (e.g. check inputs, check outputs not used elsewhere)

Test Plan:
Run adindexer model with static runtime and fusion; check ops
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adindexer/traced_precomputation2.pt --pt_inputs=/data/users/ansha/tmp/adindexer/merge/container_precomputation_bs1.pt --iters=3000 --warmup_iters=10000  --num_threads=1 --pred_net=/data/users/ansha/tmp/adindexer/precomputation_merge_net.pb --c2_inputs=/data/users/ansha/tmp/adindexer/merge/c2_inputs_precomputation_bs1.pb --c2_sigrid_transforms_opt=1 --c2_use_memonger=1 --c2_weights=/data/users/ansha/tmp/adindexer/merge/c2_weights_precomputation.pb --pt_enable_static_runtime
```
transformed model graph contains the fused ops: P151559641

Results before fusion: P151567611
Results after fusion: P151566783 (8% speedup for bs=20, 14% speedup for bs=1)

Reviewed By: hlu1

Differential Revision: D25224107

fbshipit-source-id: c8442e8ceb018879c61ce564367b1c1b9412601b
2020-12-08 05:54:49 -08:00
Scott Wolchok
55b93735ac [PyTorch] Save refcount decrements in StaticRuntime::deallocate_registers (#48859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48859

Code comment should explain what's going on. If not, please request changes.
ghstack-source-id: 117889942

Test Plan: Internal benchmarks

Reviewed By: hlu1

Differential Revision: D25288842

fbshipit-source-id: 6bddebb99c4744e2f7aceb279fdf995821404606
2020-12-04 21:47:00 -08:00
Scott Wolchok
0f9823d888 [PyTorch] Save some space in ProcessedNode (#48861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48861

`std::function` already has an empty state; no need to wrap
it in `c10::Optional`.
ghstack-source-id: 117891382

Reviewed By: hlu1

Differential Revision: D25296912

fbshipit-source-id: 8291bcf11735d49db17415b5de915591ee65f781
2020-12-04 14:42:20 -08:00
Hao Lu
4976208e73 [caffe2] Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator (#48161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161

- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime

Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/

Reviewed By: dzhulgakov

Differential Revision: D24977611

fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
2020-11-30 15:03:34 -08:00
Bram Wasti
0984d3123a [static runtime] add more _out variants (#48260)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48260

supporting a couple more operators

Test Plan:
use Ansha's test framework for e2e test

```
numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --pred_net=/home/bwasti/adindexer/precomputation_merge_net.pb --c2_inputs=/home/bwasti/adindexer/c2_inputs_precomputation_bs1.pb --c2_weights=/home/bwasti/adindexer/c2_weights_precomputation.pb --scripted_model=/home/bwasti/adindexer/traced_precomputation_partial_dper_fixes.pt --pt_inputs=/home/bwasti/adindexer/container_precomputation_bs1.pt --iters=30000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true --pt_cleanup_activations=true --pt_enable_out_variant=true --eps 1e-2
```

Reviewed By: hlu1

Differential Revision: D24767322

fbshipit-source-id: dce7f9bc0427632129f263bad509f0f00a21ccf3
2020-11-20 17:05:21 -08:00
Hao Lu
c5dae335e4 [PT][StaticRuntime] Move prim op impl to ops.cpp (#48210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48210

- Move prim op implementation from `ProcessedNode::run` to `getNativeOperation`
- Add out variant for `prim::listConstruct`

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test

buck run mode/dev //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1 --warmup_iters=1 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true
```

Reviewed By: ajyu

Differential Revision: D24748947

fbshipit-source-id: 12caeeae87b69e60505a6cea31786bd96f5c8684
2020-11-18 23:07:39 -08:00
Bram Wasti
cb046f7bd2 [static runtime] Initial memonger (#47759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47759

Parity reached :)

*/0 -> no memonger
*/1 -> memonger on
We can see that the impact is large when activations don't all fit in cache (6x speed up on this micro bench)
```
BM_long_static_memory_optimization/2/0         8563 ns       8559 ns      86370
BM_long_static_memory_optimization/8/0         8326 ns       8322 ns      84099
BM_long_static_memory_optimization/32/0       11446 ns      11440 ns      56107
BM_long_static_memory_optimization/512/0    6116629 ns    6113108 ns        128
BM_long_static_memory_optimization/2/1         8151 ns       8149 ns      87000
BM_long_static_memory_optimization/8/1         7905 ns       7902 ns      85124
BM_long_static_memory_optimization/32/1       10652 ns      10639 ns      66055
BM_long_static_memory_optimization/512/1    1101415 ns    1100673 ns        641
```

TODO:
[x] implementation
[x] enable/disable flag
[x] statistics about memory saved
[x] additional models

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Reviewed By: yinghai

Differential Revision: D24824445

fbshipit-source-id: db1f5239f72cbd1a9444017e20d5a107c3b3f043
2020-11-17 13:55:49 -08:00
Hao Lu
996f444c00 [pt][static_runtime] Memory model (#46896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896

The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.

As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage

Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.

https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23

Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```

|pt_cleanup_activations	|pt_enable_out_variant	|old ms/iter	|new ms/iter	|
|---	|---	|---	|---	|
|0	|0	|0.31873	|0.30228	|
|0	|1	|0.30018	|0.29184	|
|1	|0	|0.35246	|0.31895	|
|1	|1	|0.35742	|0.30417	|

Reviewed By: bwasti, raziel

Differential Revision: D24471854

fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
2020-11-03 23:47:59 -08:00
Hao Lu
d6519d4e9f [pt][static_runtime] Add option enable_out_variant (#46690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46690

- Add option enable_out_variant to Static Runtime
- Add gflags --pt_cleanup_activations and --pt_enable_out_variant to the benchmark script

Reviewed By: yinghai, houseroad

Differential Revision: D24438107

fbshipit-source-id: c1185c0fee93edc0118542b2faa8bc4ffdd19075
2020-10-22 15:00:23 -07:00
Hao Lu
1a3ea46dbf [StaticRuntime] Threading model (#46219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219

- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
  explicit InferenceModule(const torch::jit::Module& m);
  explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
  torch::jit::Module module;
  std::shared_ptr<torch::jit::Graph> graph;
  std::unique_ptr<c10::FunctionSchema> schema;

  std::unordered_map<Value*, size_t> value_to_reg;
  std::vector<size_t> input_regs; // inputs to the graph
  std::vector<size_t> output_regs; // outputs of the graph
  std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
  mutable std::vector<IValue> reg_;
  // The nodes we need to run
  std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.

- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.

- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.

This diff focuses on threading model only. Benchmarks will be separate.

Reviewed By: bwasti

Differential Revision: D24237078

fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin
e5ed037529 [StaticRuntime] Add a 'speed of light' benchmark. (#46308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308

This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.

Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1                         59518 ns      59500 ns      10909
BM_deep_wide_base/8                         74635 ns      74632 ns       9317
BM_deep_wide_base/20                        82186 ns      82147 ns       9119
BM_deep_wide_fast/1                         13851 ns      13851 ns      49825 << new
BM_deep_wide_fast/8                         22497 ns      22497 ns      32089 << new
BM_deep_wide_fast/20                        23868 ns      23841 ns      31184 << new
BM_deep_wide_jit_graph_executor/1           62786 ns      62786 ns      10835
BM_deep_wide_jit_graph_executor/8           76730 ns      76718 ns       7529
BM_deep_wide_jit_graph_executor/20          78886 ns      78883 ns       8769
BM_deep_wide_jit_profiling_executor/1       69504 ns      69490 ns      10309
BM_deep_wide_jit_profiling_executor/8       75718 ns      75715 ns       9199
BM_deep_wide_jit_profiling_executor/20      75364 ns      75364 ns       9010
BM_deep_wide_static/1                       40324 ns      40318 ns      17232
BM_deep_wide_static/8                       50327 ns      50319 ns      13335
BM_deep_wide_static/20                      53075 ns      53071 ns      12855
BM_deep_wide_static_threaded/threads:8       6258 ns      49873 ns      14008
```

PS: The implementation could probably be optimized even more.

Differential Revision: D24300702

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Pulled By: ZolotukhinM

fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
2020-10-19 23:35:55 -07:00
Hao Lu
ea4fbb2e5e [StaticRuntime] Replace hashtable based workspace with vector<IValue> (#45892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892

Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`.

Reviewed By: dzhulgakov

Differential Revision: D24098763

fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6
2020-10-08 09:50:30 -07:00
Hao Lu
e8d8de32b4 [StaticRuntime] Implement StaticRuntime::benchmark (#45639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639

`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.

Test Plan: Test results are fb internal only.

Reviewed By: yinghai, dzhulgakov

Differential Revision: D24012088

fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
2020-10-06 20:54:43 -07:00
Hao Lu
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00