Commit Graph

134 Commits

Author SHA1 Message Date
Don Jang
ad89d994c9 [Static Runtime] Support recordio format input for benchmark (#67530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67530

Currently `ptvsc2_predictor_bench` only uses the first input of a given recordio file even when the record io file contains many inputs.

This change extends `StaticRuntime::benchmark` to accept multiple input entries so that we can benchmark more extensibly and realistically using all the inputs in the recordio file.

Test Plan:
Tested `ptvsc2_predictor_bench` with / without this change executing the following command:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423_0.predictor.disagg.local  --recordio_inputs=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423.local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=0 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Reviewed By: hlu1

Differential Revision: D31947382

fbshipit-source-id: 4188271613aad201f8cad5f566e0dfed26680968
2021-10-29 14:38:14 -07:00
Scott Wolchok
9f01937caf [PyTorch][easy] Deduplicate memory planner creation code (#67265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67265

Avoid repeating this initialization code.
ghstack-source-id: 141585971

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31933368

fbshipit-source-id: 6342ae9bb82c4d152a427bad142470c3d162de69
2021-10-28 14:13:43 -07:00
Mike Iovine
8363da3f92 [SR][C2][easy] Benchmarks report # of ops (#67436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67436

This information is useful for comparing static runtime to c2

Reviewed By: d1jang

Differential Revision: D31991571

fbshipit-source-id: eb83bc4564b05d56fb9a550863eea3f6312f3f6c
2021-10-28 13:03:09 -07:00
Mike Iovine
72e25c9f4e [Static Runtime][DI] Add variadic grouped_accessor_op (#66289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66289

Add a variadic version of `grouped_accessor_op` to eliminate list construction overhead and associated refcount bumps in static runtime.

Test Plan:
Accuracy test with model 294738512_40: passes with 0 errors.
Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors.

**Perf impact**

TW replayer test w/ 800 QPS (stacked with D31620408) shows ~5% CPU decrease for storage tier.
Results:

{F673610665}

Reviewed By: hlu1

Differential Revision: D31482816

fbshipit-source-id: 14393da122cefd094c3e4f423beb897c1d17b32c
2021-10-27 12:29:33 -07:00
Scott Wolchok
6ce14e7b51 [PyTorch][Static Runtime] Cleanup: add valueVecFromFastSet (#66996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66996

We do this conversion a few times, and further diffs (which I'm trying to keep as small as possible) will do it more.
ghstack-source-id: 141496817

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D31821037

fbshipit-source-id: 1d3b54cadaedd53189aec6a35ed1a126c6fe4824
2021-10-26 14:47:15 -07:00
Mike Iovine
83355f9537 [SR][easy] Alias for c10::Symbol::fromQualString (#67162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67162

It's a bit annoying/ugly to type `c10::Symbol::fromQualString` everywhere, and we can't do `using c10::Symbol::fromQualString` since it's a static class function.

Test Plan: CI

Reviewed By: d1jang

Differential Revision: D31887042

fbshipit-source-id: 073a56c72281c20284a9feef741aed96b58a921d
2021-10-26 06:09:17 -07:00
Hao Lu
0c1b7545b6 [Static Runtime] Add more debug info to verify_no_memory_overlap() (#67206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67206

The memory overlap check still checks the memory overlap for alias ops. It only skips the check for inplace ops. This needs to be fixed if we want to use the memory overlap check in prod.

This diff only adds more debug info. It doesn't fix the aforementioned problem.

Reviewed By: d1jang

Differential Revision: D31889866

fbshipit-source-id: 05a80ace3d404f66f21a8bbdc9678485ff76c8d3
2021-10-26 01:48:41 -07:00
Mike Iovine
a0495b3cdb [SR] Remove unused operator() overload (#67001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001

The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.

This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

`buck test caffe2/test:static_runtime`

Reviewed By: hlu1

Differential Revision: D31821990

fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
2021-10-25 08:18:58 -07:00
Mike Iovine
364645cd9d [SR] Factor operator() implementation into separate function (#67125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67125

Using explicit template instantiations in D31659973 (f2582a59d0) was a bad idea. The problem is that the lvalue instantiation was for a `const` vector of `IValue`, meaning that if you tried to pass SR a non-const vector of arguments, the linker would fail to find the symbol.

The reason we didn't catch this in D31659973 (f2582a59d0) was because predictor always passes a `const` reference anyways. But we should fix this to prevent unexpected problems in the future.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D31873406

fbshipit-source-id: 5ab5a03334bed925cec11facadcedf9bec9b90ad
2021-10-25 08:17:40 -07:00
Mike Iovine
f2582a59d0 [SR] Add rvalue overload for operator() (#66648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648

Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:

1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```

The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.

Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.

Reviewed By: hlu1

Differential Revision: D31659973

fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
2021-10-22 10:51:47 -07:00
Mike Iovine
391eb1dbe3 [JIT] UseVariadicOp handles multiple lists (#66288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66288

This change makes it so `UseVariadicOp` can transform ops with many Tensor list inputs.

Input pattern:
```
%output : Type = op(%list_1, %arg_1, %list_2, %list_3)
```
Output pattern:
```
%output : Type = variadic_op(%list_11, ..., %list_1N, %arg_1, %list_21, ..., %list_2M, %list_31, ..., %list_3K, N, M, K)
```
The length of each list is passed at the end of the variadic op so that the op implementation can process the inputs appropriately. This also frees us from needing to update `hasVarArgs` in static runtime each time we add a variadic op.

This diff also makes `UseVariadicOp` more robust. Before, `list_idx` was passed as an argument. Now, `VariadicUpdater` determines `list_idx` from the node's schema.

Test Plan:
Existing variadic ops do not break:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31450811

fbshipit-source-id: 808fcc3ae8940b9e602586f38f8cf9154c9a6462
2021-10-22 10:22:33 -07:00
Don Jang
051ea5ccbf [Static Runtime] Bundle function & function_kind to carry them together (#66974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66974

`D31591785 (67e003f09b)` started carrying a function object to be executed and `FunctionKind` for the type of the function *separately*, and this caused a bug fixed by D31783028 (79803b199f).

This change bundles them as it was before done by swolchok to reduce the chances of such a mistake in the future. They need to be carried altogether always since `FunctionKind` identifies the type of the function object.

Note that `struct Function` is a POD type, so accessing its field (first, second) shouldn't cause an extra overhead in `ProcessedNode::run()`.

Test Plan:
Confirmed that the managed memory metics remain the same before/after this diff on inline_cvr:

```
#AFTER
# inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
# inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)
# inline_cvr/remote_ro
First iter time: 12.0344 ms
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

```
#BEFORE
#  inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

#inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)

#inline_cvr_remote_ro
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

Reviewed By: mikeiovine

Differential Revision: D31798419

fbshipit-source-id: fd4301b6731e402be0820729654735c791511aba
2021-10-22 08:57:49 -07:00
Mike Iovine
ab1e4eac42 [Static Runtime] Add FuseListUnpackV2 (#66509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66509

Like `FuseListUnpack`, but instead of adding arguments to the fused node's outputs, inserts a new fused op.

By using a new fused op, we can avoid runtime `is_fused` checks. This will make the op implementations significantly cleaner. Eventually, we will migrate all ops to `V2` and delete to old pass.

`FuseListUnpackV2` also fixes the bug described in T103159043.

Test Plan: I've made some changes to D31550307 locally and verified that everything works.

Reviewed By: hlu1

Differential Revision: D31492017

fbshipit-source-id: 4f90fcbc17e4c70a3d65985bee836fabf868a22c
2021-10-20 16:39:32 -07:00
Don Jang
67e003f09b [Static Runtime] Determine function for ProcessedNode::run() statically (#66692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66692

Currently `ProcessedNode::run()` performs 2 dynamic dispatches to decide which function implementation to execute depending on if the function is an out variant / native / or interpreter fallback. Note that this is happening every time an operation is executed by Static Runtime dynamically.

This change makes *that* same decision during module loading time once so that we can remove 1 dynamic dispatch cost at runtime.

**size reduction**

Saving 4 bytes per `ProcessedNode`.

- Before: sizeof(c10::variant<OutVariant, NativeFunction, Operation>):40

- After: sizeof(std::function<void(ProcessedNode*)>): 32 + sizeof(FunctionKind):4 = 36

**latency optimization**

Expected to remove 2 memory loads & 1 conditional jump per `ProcessedNode::run()` execution (needs to be confirmed from compiled binary code).

Ran `ptvsc2_predictor_bench` with `inline_cvr` with 1000 iterations:
- local : 7.56026 -> 7.24794
- local_ro: 1.5799. -> 1.55504.
- remote_ro: 10.6464 -> 10.3017

Test Plan: Ran existing unittests

Reviewed By: swolchok

Differential Revision: D31591785

fbshipit-source-id: 5de83ca386af509381e08ecedf071ee4e9f0f0b0
2021-10-15 14:07:24 -07:00
Scott Wolchok
e88d1c4f10 [PyTorch] Add tuple inline storage (#64066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066

I noticed a bunch of time being spent heap-allocating Tuples
in the unpickler. 1-, 2-, and 3-element Tuples are apparently common
enough that they get their own bytecode instructions, so I decided to
try also giving them their own representation. We store up to 3
IValues inline in `Tuple` rather than doing a second heap allocation
for a `std::vector<IValue>`.
ghstack-source-id: 140695395

Test Plan:
Added automated tests for TupleElements.

Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422
We went from 347 ms to 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30592622

fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8
2021-10-15 12:16:51 -07:00
Hao Lu
6310eb30d1 [SR] Clean up GetLivenessMap (#66606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66606

- Remove dead code (see comment for where)
- Add debug prints
- Small reorganization of the code to improve readability

Reviewed By: d1jang

Differential Revision: D31568219

fbshipit-source-id: 50240c325bf4fd012e1947ac931bb67c6f5dfafb
2021-10-13 23:55:40 -07:00
Hao Lu
6634570aef [SR] Fix bug in ValueGroup (#66470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470

Reviewed By: d1jang

Differential Revision: D31566348

fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3
2021-10-13 19:26:38 -07:00
Scott Wolchok
d30397d42a [PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429

The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722

Test Plan:
CI

I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.

Reviewed By: hlu1

Differential Revision: D31096042

fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
2021-10-13 19:13:20 -07:00
Don Jang
736fa09a9a [Static Runtime] Manage output tensors (#65515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515

This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:

- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output  `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for  inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
-  The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.

This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.

Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.

- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).

Reviewed By: hlu1

Differential Revision: D31049221

fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
2021-10-11 09:50:54 -07:00
Scott Wolchok
5a67ffe0ad [PyTorch][Static Runtime] Combine ProcessedNode::{native_,}fn_ (#65414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65414

Saves 24 bytes (`sizeof(std::function) - 8`) per ProcessedNode.
ghstack-source-id: 139999909

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31085561

fbshipit-source-id: 70734b8319e805736ba41aedaaf7fa3d463400c9
2021-10-08 18:11:59 -07:00
Scott Wolchok
3ef69a4598 [static runtime] Pre-allocate hash tables (#65343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65343

No reason not to save a bit on re-hashing.
ghstack-source-id: 140052518

Test Plan:
CI

Static runtime startup seems to go from 5.9-6.0s to 5.8s-6.0s, perf shows less time spent rehashing

Reviewed By: mikeiovine

Differential Revision: D31027362

fbshipit-source-id: 39dd53ecd462693b518535856ddd92df78a4977b
2021-10-08 10:28:13 -07:00
Don Jang
416f593080 [Static Runtime] Group graph nodes into input aliases & output aliases (#65517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517

This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:

- input_aliases:  values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.

`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.

Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.

Reviewed By: hlu1

Differential Revision: D30940955

fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
2021-10-07 14:35:12 -07:00
Mike Iovine
057a01556c [Static Runtime] Do not use variadic_sigrid_transforms_torch_bind if out variant is disabled (#66221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66221

JIT doesn't have an implementation for this op, so we can only use it when out variants are enabled.

Reviewed By: hlu1

Differential Revision: D31445887

fbshipit-source-id: 4565ac4df751d8ee4052647574c43efa05ea1452
2021-10-07 06:57:17 -07:00
Mike Iovine
a5e6b2b2e3 [Static Runtime] Add variadic sigrid_transforms_torch_bind (#63960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63960

Reviewed By: hlu1

Differential Revision: D30529880

fbshipit-source-id: 1c4be2f9c0944bbe1e1c146989588c96bfd14eda
2021-10-05 16:00:36 -07:00
Hao Lu
a6ad2b41ac [Static Runtime] Make module_ optional in StaticModule (#65882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65882

`torch::jit::Module` is refcounted. There is no need to wrap it in a `shared_ptr`.

Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: mikeiovine

Differential Revision: D31012222

fbshipit-source-id: 74d234bd85423e5ba0e396f24899631354a2c74b
2021-09-30 22:48:49 -07:00
Don Jang
4176afc4a0 [Static Runtime] Disable SigridTransform + ListUnpack fusion when outputs reachable from graph output (#62697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62697

Reviewed By: hlu1

Differential Revision: D29979402

fbshipit-source-id: 913e8396a0530ce3617211112a2b1147ef2e9df9
2021-09-29 22:47:48 -07:00
Mike Iovine
b003b2a9c0 [Static Runtime] Add record functions (#64698)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64698

Reviewed By: hlu1

Differential Revision: D30747191

fbshipit-source-id: 7ded6ea9bd36b5e3343d1efa9f3c92e02ff6d7f8
2021-09-24 07:20:17 -07:00
Raghavan Raman
14307f7a56 [Static Runtime] Added logging to dump the model graphs (#65509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65509

With this change, we can get dumps of the model graphs by setting the env variable `PYTORCH_JIT_LOG_LEVEL=">>impl"` while running the model.

Test Plan: buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine

Differential Revision: D31125797

fbshipit-source-id: d8979a4e138047518140e0eaecb46e012891b17c
2021-09-23 10:06:13 -07:00
Raghavan Raman
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
Scott Wolchok
c0eb266c02 [Static runtime] Micro-optimization pass on GetLivenessMap (#65175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65175

More efficient use of map API, more efficient way to insert all pairs of inputs/outputs in liveness map
ghstack-source-id: 138547815

Test Plan: Time to enable static runtime down from ~8.7s to ~8.4s

Reviewed By: mikeiovine

Differential Revision: D30983897

fbshipit-source-id: fa6000bfd0fa0adfcd7c5922199ee32ada8c430e
2021-09-21 10:52:08 -07:00
Mike Iovine
99e4ab5d44 [Static Runtime] Implement and enable variadic tuple unpack (#64934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934

Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```

The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`

Reviewed By: d1jang

Differential Revision: D30872109

fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
2021-09-20 10:36:11 -07:00
Don Jang
7f8d622d70 [Static Runtime] Add perf metrics for number of managed tensors & unmanaged values (#64992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64992

This change lets Static Runtime print out number of managed tensors & unmanaged values as performance metrics during profile runs.

We will use /enhance these metrics to guide the effort of managing output tensors.

Test Plan:
Confirmed that a profile run prints out the added metric values on inline_cvr nets:
```
(inline_cvr/local)
...
Total number of managed tensors: 2754
Total number of unmanaged values: 3240
...
(inline_cvr/local_ro)
Total number of managed tensors: 1554
Total number of unmanaged values: 2966
...
(inline_cvr/remote_ro)
Total number of managed tensors: 1439
Total number of unmanaged values: 28
...
```

Reviewed By: hlu1

Differential Revision: D30926617

fbshipit-source-id: b86e071003ac941b9663db103eaa7c614466b4e0
2021-09-18 11:26:37 -07:00
Don Jang
ae00075ac7 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123

This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.

```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57   return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57    auto sr = getStaticRuntime();
Sep 15 15:39:57         ^    ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 2 errors generated.
```

This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).

Original Summary:

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: - Confirm that OSS build went well (See External Tests section).

Reviewed By: mikeiovine

Differential Revision: D30983292

fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
2021-09-17 13:32:01 -07:00
Don Jang
8241193d76 [Static Runtime] Introduce static_runtime::dict_unpack (#64771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771

Test Plan:
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict`
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict`
- TBD: Perf impact measurement

Reviewed By: mikeiovine

Differential Revision: D30685083

fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9
2021-09-16 23:25:13 -07:00
Scott Wolchok
f69cf3cf2f [Static Runtime] Use FastSet instead of std::set everywhere (#65114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65114

There doesn't seem to be any reason to use std::set for sets of pointers, right?
ghstack-source-id: 138198504

Reviewed By: hlu1

Differential Revision: D30978450

fbshipit-source-id: 4599c6249fda3a89959f839d3bf6400c5891f82c
2021-09-15 21:44:54 -07:00
Natalia Gimelshein
ec1af11c2e Revert D30883290: [Static Runtime] Move MemoryPlanner out into memory_planner.cpp
Test Plan: revert-hammer

Differential Revision:
D30883290 (0e11454d19)

Original commit changeset: a37570f8d943

fbshipit-source-id: 65c57a2b0d2e3c7006765195dd519e8cf2472f72
2021-09-15 15:40:34 -07:00
Don Jang
0e11454d19 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65011

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D30883290

fbshipit-source-id: a37570f8d9430224a6987d2190bcf81cf875043d
2021-09-15 12:57:39 -07:00
Don Jang
3fb33b38b9 [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
2021-09-15 08:38:05 -07:00
Mike Iovine
369db8924f [Static Runtime] Add first iter metric (#64457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64457

The first iteration is special since it initializes the memory planner. This change logs and reports first iteration time during benchmarking. It also generates a FAI-PEP output when `generate_ai_pep_output` is set.

Test Plan:
Run any benchmark, and observe:
```
I0902 15:19:32.528977 2492358 impl.cpp:948] PyTorchObserver {"value":6.415958881378174,"unit":"ms","metric":"latency","type":"static_runtime_first_iter"}
...
First iter time: 6.41596 ms
```

Note that this metric is likely to have significantly more noise than the others since we don't have as many data points.

Unit tests: `buck test //caffe2/test:static_runtime`

Reviewed By: d1jang

Differential Revision: D30740619

fbshipit-source-id: 4dcfccd5629f4fa34254fd355073ef19e151245a
2021-09-07 15:00:30 -07:00
Mike Iovine
4aad366111 [Static Runtime] Make per-op latency readable by FAI-PEP (#64315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64315

Add a new flag `generate_ai_pep_output` to `StaticRuntime::benchmark`. If set, produces per-op-kind average total latency in milliseconds in a JSON format recognized by [Facebook AI performance evaluation platform (FAI-PEP)](https://github.com/facebook/FAI-PEP).

This is useful for observing the impact of changes that make a big difference for a specific op, but do not affect the overall SR latency by more than a few percent.

Reviewed By: hlu1

Differential Revision: D30679352

fbshipit-source-id: c847fa6ea20774aaf1e7949b11db4421d1f70b7e
2021-09-01 14:34:22 -07:00
Zhengxu Chen
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
Mike Iovine
07c5cb8c48 [Static Runtime] Optimize memory planner initialization (#64101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101

Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.

There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.

Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: movefast1990

Differential Revision: D30595579

fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
2021-08-27 17:40:43 -07:00
Don Jang
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
Hao Lu
3c3bba4169 [Static Runtime] Use F14FastMap/F14FastSet (#63999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63999

Use folly::F14FastMap/F14FastSet instead of std::unordered_map/unordered_set in the Static Runtime code base. folly::F14FastMap/F14FastSet implements the same APIs as std::unordered_map/unordered_set but faster. For details see https://github.com/facebook/folly/blob/master/folly/container/F14.md

Reviewed By: d1jang

Differential Revision: D30566149

fbshipit-source-id: 20a7fa2519e4dde96fb3fc61ef6c92bf6d759383
2021-08-27 01:40:41 -07:00
Mike Iovine
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
Mike Iovine
d96ef8c1b1 [Static Runtime] SR clones graph input (#63704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63704

Previously SR did not clone the graph. This was leading to subtle bugs in `testStaticRuntime`; static runtime would modify its graph, and the graph used by the JIT interpreter would change as well. The JIT interpreter would then crash if SR-only ops were added!

Cloning the graph is more consistent with the behavior of the `Module` ctor.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D30463294

fbshipit-source-id: b771551a1f55f95fde79373b23babcf3e5ddf726
2021-08-23 18:45:41 -07:00
Mike Iovine
fc6dd0bc00 [JIT] Move UseVariadicCat internals (#63577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63577

Since other variadic ops will have an almost identical implementation, we can generalize the `UseVariadicCat` implementation and put it in a common folder.

Also moved some test utilities that other variadic op tests will likely need.

Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOptTest`

Reviewed By: navahgar

Differential Revision: D30409937

fbshipit-source-id: 925c11c27b58ce98cb8368d2a205e26ba66d3db9
2021-08-23 17:30:36 -07:00
Mike Iovine
779a3d47b0 [Static Runtime] Benchmark reports native nodes (#63346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63346

We have seen that we can get significant perf wins essentially for free by implementing native ops for ops that we cannot write out variants for (e.g. TupleUnpack D30306955 (078b8004a6), append D30326461 (9d9e7a8d72)). Therefore, whether or not SR is using a native implementation is valuable information. By capturing this in the benchmarking suite, we can hopefully avoid wasting time profiling/manually inspecting `native_ops.cpp`

Reviewed By: hlu1

Differential Revision: D30346752

fbshipit-source-id: 205b090513b6a5a6ce4cb92f75ab0395b15d08f9
2021-08-18 15:05:08 -07:00
Don Jang
075024b9a3 [Static Runtime] Fix a bug that assigns multiple outputs to single storage (#63012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63012

This change fixes a bug that the static runtime's memory optimizer assigns multiple outputs of a node to the same storage.  Fixing this bug enables the static runtime to run `inline_cvr` with its memory optimizer enabled.

A problematic line from `inline_cvr` was as follows:
```
  %7767 : Tensor, %getitem_6419.1 : Tensor = fb::gather_ranges(%tensor74.1, %7764)
```
where enabling the memory optimizer assigns `%7767` and `%getitem_6419.1` to the same storage, which made their data corrupted during the 2nd iteration.

This change fixed the aforementioned bug by marking all inputs & outputs of a node as `alive` during our liveness analysis. By doing that, no inputs / outputs will collide with each other. I believe this is a fair assumption that most ops' implementation always has, but missing in our analysis before this change.

Test Plan: - Added a unittest `StaticRuntime.ValuesShareSameStorageDoesNotContainOutputsFromSameNode` to cover the new code.

Reviewed By: hlu1

Differential Revision: D30202018

fbshipit-source-id: 10287a1bee9e86be16a5201e9a7cd7c7f046bab9
2021-08-16 16:52:02 -07:00
Hao Lu
aa63c0d9df [PyPer] Skip printing out per node time when do_profile is on (#63256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63256

This suppresses printing out the per node time which is very long when the net has too many ops. It can be easily turned on by setting `--pt_sr_print_per_node_time=1`.

Reviewed By: ajyu, mikeiovine

Differential Revision: D30298331

fbshipit-source-id: 32b3f93b3fe19d335654168311fda93331a1e706
2021-08-16 16:32:19 -07:00