Commit Graph

157 Commits

Author SHA1 Message Date
Donald Dong
d9f3feb5a2 [SR] Use std::vector::reserve for StaticModule constants (#68834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68834

This diff uses std::vector::reserve for constructing constants in StaticModule. We can also avoid two extra iterations over all the graph nodes.

This diff should technically improve its performance by a tiny bit.

Test Plan: - [x] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1

Reviewed By: mikeiovine

Differential Revision: D32628806

fbshipit-source-id: 99dd2a7a36e86899ca1fe5300f3aa90d30a43726
2021-11-23 18:00:04 -08:00
Mike Iovine
ee4cfaa286 [SR] Add utility class to determine tensor ranges (#68284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284

Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: swolchok

Differential Revision: D32397207

fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
2021-11-19 13:10:55 -08:00
Scott Wolchok
ced57eb490 [PyTorch][Static Runtime] Delete incorrect alias analysis code (#67075)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67075

Sharing storage if `mayAlias` is incorrect, as the old comment notes; sharing if `mustAlias` would be nice but, as the new comment notes, would not matter.
ghstack-source-id: 143749553

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31851893

fbshipit-source-id: 5bdc8de984d5919332c9010e8b0160211d96bc2f
2021-11-18 22:34:50 -08:00
Don Jang
aa9ee8d02a [Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368

Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.

However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.

This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.

This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.

Thanks to hlu1 for proposing this non-intrusive improvement idea :D

Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.

==AFTER

* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087

* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101

==BEFORE

* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18

* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453

==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:

==AFTER

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

==BEFORE

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

Reviewed By: swolchok

Differential Revision: D32337548

fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
2021-11-16 20:28:48 -08:00
Hao Lu
75ccb07b26 [SR] LOG->VLOG (#68477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68477

We're printing a lot of unnecessary logs in prod. Change these from LOG(INFO) to VLOG(1) so you can easily flip them back for testing.

Test Plan: CI

Reviewed By: ajyu, d1jang

Differential Revision: D32439776

fbshipit-source-id: 40fa57f4eeb6ca0b610008062cc94aed62fb6981
2021-11-16 17:09:52 -08:00
Scott Wolchok
10e9d80ad1 [PyTorch][Static Runtime] Don't track scalar ivalues (#67702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67702

This isn't a particularly large optimization and it does
nothing before select_tensor is introduced (I'm surprised that no
operators have optimizable outputs!), but it seems like we should probably get the savings.
ghstack-source-id: 143424918

Test Plan: CI; checked `--do_profile=1` ouput with following diff and we save tracking hundreds of values, as expected.

Reviewed By: hlu1

Differential Revision: D32112522

fbshipit-source-id: 1804b77992a73670bfc1e36af608b852b8261bd2
2021-11-16 11:05:42 -08:00
Scott Wolchok
639258499f [PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935

Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112

Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.

Previous diff, local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```

This diff, local:

```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)

Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.

Reviewed By: hlu1

Differential Revision: D32216573

fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
2021-11-16 10:21:12 -08:00
Scott Wolchok
6acde23bec [PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934

This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113

Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.

Previous diff, CMF local:

```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```

This diff, CMF local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```

Perf looks neutral; see next diff (D32216573) test plan for details.

Reviewed By: hlu1

Differential Revision: D32190751

fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
2021-11-16 10:19:50 -08:00
Scott Wolchok
8678472ec8 [PyTorch][Static Runtime] Save 2 pointers in ProcessedNode (#67860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67860

We don't need 8-byte sizes for inputs and outputs, and we only need op names if profiling isn't disabled.
ghstack-source-id: 143429111

Test Plan:
Ran CMF local & local_ro with recordio inputs. I'm calling
the result inconclusive/neutral because I saw some noise (as you'll
see below), but that's fine with me since this is a clear memory win.

```
Nov4Stable, local_ro
========================================
I1104 09:53:08.875444 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19925. Iters per second: 833.851
I1104 09:53:10.200443 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.1996. Iters per second: 833.608
I1104 09:53:11.524045 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.19746. Iters per second: 835.103
I1104 09:53:12.851861 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20479. Iters per second: 830.019
I1104 09:53:14.183387 505783 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20487. Iters per second: 829.964
I1104 09:53:14.183427 505783 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.2012, standard deviation: 0.00341762

re-ran stable in light of baffling regression (see next entry), and sure enough we still have some significant run-to-run-variation:

I1104 09:56:15.244969 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24956. Iters per second: 800.28
I1104 09:56:16.621292 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24776. Iters per second: 801.437
I1104 09:56:18.018808 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25247. Iters per second: 798.42
I1104 09:56:19.399660 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25054. Iters per second: 799.656
I1104 09:56:20.781828 524012 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.25052. Iters per second: 799.664
I1104 09:56:20.781878 524012 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.25017, standard deviation: 0.00171396

Nov4SaveTwoWordsInProcessedNode, local_ro
========================================
I1104 09:53:42.070139 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.2411. Iters per second: 805.736
I1104 09:53:43.438390 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24102. Iters per second: 805.788
I1104 09:53:44.773303 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.20682. Iters per second: 828.621
I1104 09:53:46.110538 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21216. Iters per second: 824.973
I1104 09:53:47.448279 508309 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.21265. Iters per second: 824.639
I1104 09:53:47.448334 508309 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22275, standard deviation: 0.0168698

early runs look like a glitch, rerunning

I1104 09:54:20.999117 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24558. Iters per second: 802.841
I1104 09:54:22.376780 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24436. Iters per second: 803.623
I1104 09:54:23.738584 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23176. Iters per second: 811.845
I1104 09:54:25.113063 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.24938. Iters per second: 800.395
I1104 09:54:26.476349 511022 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23552. Iters per second: 809.377
I1104 09:54:26.476395 511022 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.24132, standard deviation: 0.00737197

Nov4Stable, local
========================================

I1104 09:57:56.854537 533814 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177885632
I1104 09:58:02.829813 533814 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 09:58:03.010681 533814 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590507056
I1104 09:58:03.010710 533814 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 09:58:58.839010 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0567. Iters per second: 49.8586
I1104 09:59:54.797755 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.1007. Iters per second: 49.7494
I1104 10:00:50.696525 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0657. Iters per second: 49.8363
I1104 10:01:46.514736 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0696. Iters per second: 49.8265
I1104 10:02:42.378270 533814 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 20.0641. Iters per second: 49.8402
I1104 10:02:42.378316 533814 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 20.0714, standard deviation: 0.0170605
I1104 10:02:42.378325 533814 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591882400

Nov4SaveTwoWordsInProcessedNode, local
========================================
I1104 10:38:15.543320 733514 PyTorchPredictorBenchLib.cpp:346] memory turnover after getPredictor: 177721792
I1104 10:38:21.224673 733514 PrepareModelInputs.cpp:190] Loaded 696 records.
I1104 10:38:21.382973 733514 PyTorchPredictorBenchLib.cpp:353] memory turnover before benchmarking: 4590343216
I1104 10:38:21.382992 733514 PyTorchPredictorBenchLib.cpp:154] PyTorch predictor: number of prediction threads 1
I1104 10:39:17.005359 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9498. Iters per second: 50.1257
I1104 10:40:12.545269 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9279. Iters per second: 50.1808
I1104 10:41:08.138119 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.999. Iters per second: 50.0026
I1104 10:42:03.686841 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.9115. Iters per second: 50.2222
I1104 10:42:55.137498 733539 Proxy2Connection.cpp:343] Received NotRegisteredException from Configerator Proxy2.
I1104 10:42:55.138715 733539 ReadOnlyConnectionIf.h:91] Mark connection as healthy.
I1104 10:42:55.384534 733514 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6297. Iters per second: 50.9433
I1104 10:42:55.384579 733514 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.8836, standard deviation: 0.14571
I1104 10:42:55.384588 733514 PyTorchPredictorBenchLib.cpp:366] memory turnover after benchmarking: 4591711760
```

Reviewed By: d1jang

Differential Revision: D32177531

fbshipit-source-id: 267e38a151d2dbab34fd648135d173cfbee1c22e
2021-11-16 10:12:53 -08:00
Mike Iovine
9ed49449b3 [SR] Add net level record functions (#68091)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68091

Add record functions for recording perf stats on the entire network.

Note that this is backed by the same pre-sampling mechanism as the op record functions, so net level stats get logged relatively infrequently. (If this is not acceptable, we can not use pre-sampling at the cost of a little bit of perf, every inference will require an RNG call)

Reviewed By: hlu1

Differential Revision: D32296756

fbshipit-source-id: 09ff16c942f3bfc8f4435d6cca2be4a6b8dc6091
2021-11-15 15:39:08 -08:00
Andres Suarez
0bd0a67c4f [lint][fbcode/caffe2] CLANGFORMAT
Test Plan:
Proof of coverage:

```
$ hg files fbcode/caffe2 |
  arc linttool debugfilterpaths --take CLANGFORMAT --path-match-only > ~/before.txt

$ hg up this_diff

$ hg files fbcode/caffe2 |
  arc linttool debugfilterpaths --take CLANGFORMAT --path-match-only > ~/after.txt

$ comm -3 ~/before.txt ~/after.txt | pastry
P467377980: https://www.internalfb.com/intern/paste/P467377980/
```

These files lost coverage:

- `fbcode/caffe2/torch/abi-check.cpp`
- `fbcode/caffe2/torch/custom_class.h`
- `fbcode/caffe2/torch/custom_class_detail.h`
- `fbcode/caffe2/torch/deploy.h`
- `fbcode/caffe2/torch/extension.h`
- `fbcode/caffe2/torch/library.h`
- `fbcode/caffe2/torch/script.h`

Everything else in P467377980 gained coverage.

Reviewed By: suo

Differential Revision: D32364856

fbshipit-source-id: 9b3ba3350ecdb50038412a24af5e0da0fe4d69b8
2021-11-12 05:12:39 -08:00
Don Jang
9cb65df79f [Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939

With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via  `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()`  the client will experience a crash (intended behavior not to  leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).

Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.

This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls  `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.

A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.

Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.

Reviewed By: swolchok

Differential Revision: D32219731

fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
2021-11-11 17:31:07 -08:00
Hao Lu
47bc47f2b9 [SR] Add runtime check to correct bad schema alias info (#67825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825

The comment explains how it works.

Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619

# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424

# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765

# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658

# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```

Reviewed By: d1jang

Differential Revision: D32124812

fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
2021-11-10 19:35:11 -08:00
Mike Iovine
5e19fb61fd [SR] Release reference to JIT module if possible (#67911)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67911

If we can remove `self` from the graph inputs, there is no need for `StaticModule` to hold onto its `Module` reference anymore.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32190755

fbshipit-source-id: 9c4649a63b6e68c7d2e47395a23572985d2babb1
2021-11-09 10:35:31 -08:00
Mike Iovine
5bc89275dd [SR] Eliminate no-ops (#67437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437

Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.

Test Plan: `buck test caffe2/test:jit -- test_freezing`

Reviewed By: hlu1

Differential Revision: D31980843

fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
2021-11-08 08:42:33 -08:00
Scott Wolchok
b0c05297f9 [Static Runtime] Arena allocate StorageImpls for managed tensors (#66130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66130

We're reusing backing storage for these tensors, which is only safe because they have non-overlapping lifetimes. Accordingly, it seems that they can also share their StorageImpl.

ghstack-source-id: 142427752

Test Plan:
benchmarked ctr_mobile_feed local and local_ro:

Using recordio inputs for model 302008423_0

```
swolchok@devbig032 ~/f/fbcode> env MKL_NUM_THREADS=1 OMP_NUM_THREADS=1  > environment^C
swolchok@devbig032 ~/f/fbcode> sudo ~/fbsource2/fbcode/scripts/bertrand/noise/denoise-env.sh \
                                 /tmp/ptvsc2_predictor_benchNov1ArenaAllocateStorageImpls \
                               --scripted_model=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.predictor.disagg.local \
                               --method_name=local.forward --pt_cleanup_activations=1 \
                               --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=2 --warmup_iters=2 \
                                      --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 --repetitions=5 --recordio_use_ivalue_format=1 --recordio_inputs=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.local.inputs.recordio

Stable
========================================
I1101 14:19:16.473964 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0131. Iters per second: 49.9673
I1101 14:20:12.193130 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0155. Iters per second: 49.9612
I1101 14:21:07.761898 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9751. Iters per second: 50.0624
I1101 14:22:03.218066 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9104. Iters per second: 50.2249
I1101 14:22:58.723256 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.956. Iters per second: 50.1102
I1101 14:22:58.723306 2748837 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.974, standard deviation: 0.043643

ArenaAllocateStorageImpls
========================================
I1101 14:08:57.070914 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9771. Iters per second: 50.0572
I1101 14:09:52.605121 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.924. Iters per second: 50.1907
I1101 14:10:48.098287 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9353. Iters per second: 50.1624
I1101 14:11:43.645395 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9723. Iters per second: 50.0694
I1101 14:12:39.171636 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9673. Iters per second: 50.0819
I1101 14:12:39.171685 2695478 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.9552, standard deviation: 0.0239318

difference: 0.0188 (0.09%), which is less than 1 standard deviation

Stable, local_ro
========================================
I1101 14:26:10.796161 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25991. Iters per second: 793.708
I1101 14:26:12.194727 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26862. Iters per second: 788.26
I1101 14:26:13.591312 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26549. Iters per second: 790.207
I1101 14:26:14.982439 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25943. Iters per second: 794.01
I1101 14:26:16.377033 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25995. Iters per second: 793.68
I1101 14:26:16.377094 2787930 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.26268, standard deviation: 0.00414788

ArenaAllocateStorageImpls, local_ro
========================================
I1101 14:26:45.875073 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20987. Iters per second: 826.536
I1101 14:26:47.207271 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20827. Iters per second: 827.633
I1101 14:26:48.533766 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20023. Iters per second: 833.174
I1101 14:26:49.850610 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19206. Iters per second: 838.884
I1101 14:26:51.172356 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19958. Iters per second: 833.622
I1101 14:26:51.172411 2790009 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.202, standard deviation: 0.00722754

Difference: 0.06 usec/iter (4.8%), which is much more than 1 standard deviation

```

we can see that this is a large relative improvement on local_ro, but no effect on local.

Reviewed By: hlu1

Differential Revision: D31357486

fbshipit-source-id: 229c003677da76e89c659d0e0639002accced76e
2021-11-04 15:43:39 -07:00
Mike Iovine
0eaa01ead1 [SR] Add EliminateTrivialEquallySplit graph pass (#67166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67166

This optimization is not really the same thing as `FuseListUnpack`, and mixing the logic in that pass is confusing and error-prone. It should really be its own pass.

It's slower since we have to do another pass over the graph, but this is not perf critical code; readability is more important.

Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D31887458

fbshipit-source-id: 289e281d512435861fccfe19f017751ad015688c
2021-11-03 12:57:05 -07:00
Scott Wolchok
510336499b [PyTorch][Static Runtime] Separate overlap checks for easier debugging (#66637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66637

We can give more information when verify_no_memory_overlap would fail by separating the DCHECK.
ghstack-source-id: 142226105

Test Plan: fitsships

Reviewed By: d1jang

Differential Revision: D31517151

fbshipit-source-id: 8cbc324c27f6b4db4489d1bd469d37b1d8ae6ce1
2021-11-02 23:59:04 -07:00
Don Jang
e86a5a3a1a [Static Runtime] Add PyTorchPredictor::predict_managed_result to return managed output tensors (#65598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65598

This change adds `PyTorchPredictor::predict_managed_result` to enable Static Runtime to return managed output tensors, allocated and owned by Static Runtime to accelerate inference workloads.

- `PyTorchPredictor::predict_managed_result` does only meaningful work for the overridden `PyTorchStaticRuntimePredictor::predict_managed_result`. For other subclasses, it returns a simple object that just wraps the returned `Ivalue`.

- When `manage_output_tensors` is enabled, a `StaticRuntime` cannot be reentered until its return value gets deallocated by calling `StaticRuntime::deallocateOutputTensors`. Currently an instance of `StaticRuntime` gets immediately pushed back to `static_runtime_pool` to be reentered again, and this cannot be done when `manage_output_tensors` is enabled. `PyTorchStaticRuntimePredictorManagedResult` makes sure to delay pushing a `StaticRuntime` instance back to the pool only after `StaticRuntime::deallocateOutputTensors` is called on the runtime instance.

- When `manage_output_tensors` is enabled, `PyTorchStaticRuntimePredictor::predict_managed_result` returns the prediction result, whose backing memory is managed by an instance of `StaticRuntime`. The lifetime of any value reachable from `PyTorchStaticRuntimePredictorManagedResult.get()` is expected to end before `PyTorchStaticRuntimePredictorManagedResult` gets destructed. As explained above, `PyTorchPredictorManagedResult`'s destruction pushes the runtime instance that returned the result back to `static_runtime_pool` to be reused again.

- The current API design of adding `predict_managed_result` instead of forcing `operator()` to return `PyTorchPredictorManagedResult` was motivated by the fact that `manage_output_tensors` will be selectively enabled just for a few models. In case `manage_output_tensors` becomes a commonly used feature we should revisit this API design to merge them together.

Reviewed By: hlu1

Differential Revision: D31149323

fbshipit-source-id: 5ca026188077232d6a49a46759124a978439d7b2
2021-11-02 22:10:26 -07:00
Mike Iovine
2644725937 [SR] Migrate gather_ranges_to_dense to new FuseListUnpack (#67164)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67164

Migrated both the variadic and non-variadic versions.

This diff is part of the effort to migrate all ops used in `FuseListUnpack` to `FuseListUnpackV2`. The original version of `FuseListUnpack` is problematic for a few reasons:

* You have to complicate the op implementation with an `is_fused` check, resulting in messier code. It is easier to reason about two ops, fused (out variant) and unfused (native).
* The original version of `FuseListUnpack` is buggy. It assumes that the `ListUnpack` node occurs immediately after the fusion candidate, which is not necessarily true.

This diff finishes the migration, so the original version of `FuseListUnpack` is removed

Test Plan:
Unit tests: `buck test caffe2/benchmarks/static_runtime/...`

**Accuracy Test**
Done at the top of this diff stack.

Reviewed By: hlu1

Differential Revision: D31887386

fbshipit-source-id: 9d44c813667a75bce13dce62bf98e6109edea6ba
2021-11-02 11:04:59 -07:00
Scott Wolchok
82f7f8d471 [PyTorch] Adopt IValue::toTupleRef() where obvious (#65505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505

Generated with

`fastmod -m 'toTuple\(\)(\s*)->' 'toTupleRef()${1}.'`

, followed by

`fastmod '(std::move\(.*)toTupleRef\(\).' '${1}toTuple()->'`

to unbreak 2 callsites.
ghstack-source-id: 142065835

Test Plan: CI

Reviewed By: gchanan

Differential Revision: D31131025

fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34
2021-11-02 10:22:18 -07:00
Scott Wolchok
7cd62621fb [PyTorch] Adopt faster Tuple::create (#65381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65381

The previous diff adds a way to make Tuples of size 3 or less
more efficiently. This diff makes it easier to hit that path and
updates a bunch of callsites to hit it.
ghstack-source-id: 142065832

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D31069538

fbshipit-source-id: d04da3709594ed68ab1c0a1471f8cffd8d001628
2021-11-02 10:10:31 -07:00
Mike Iovine
0d7cf825fc [SR] Drop support for aten::__is__ and aten::__isnot__ (#67550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550

`aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations.

Some examples:

- Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this:
```
def forward(x):
    y = x.detach()
    return x is y
```
We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`.

- `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards
```
def forward(x):
    y = x.to(x.dtype)
    return x is y
```

- And many more, `FuseListUnpack` can break too

Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D32022584

fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e
2021-11-01 04:45:14 -07:00
Don Jang
ad89d994c9 [Static Runtime] Support recordio format input for benchmark (#67530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67530

Currently `ptvsc2_predictor_bench` only uses the first input of a given recordio file even when the record io file contains many inputs.

This change extends `StaticRuntime::benchmark` to accept multiple input entries so that we can benchmark more extensibly and realistically using all the inputs in the recordio file.

Test Plan:
Tested `ptvsc2_predictor_bench` with / without this change executing the following command:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423_0.predictor.disagg.local  --recordio_inputs=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423.local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=0 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Reviewed By: hlu1

Differential Revision: D31947382

fbshipit-source-id: 4188271613aad201f8cad5f566e0dfed26680968
2021-10-29 14:38:14 -07:00
Scott Wolchok
9f01937caf [PyTorch][easy] Deduplicate memory planner creation code (#67265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67265

Avoid repeating this initialization code.
ghstack-source-id: 141585971

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31933368

fbshipit-source-id: 6342ae9bb82c4d152a427bad142470c3d162de69
2021-10-28 14:13:43 -07:00
Mike Iovine
8363da3f92 [SR][C2][easy] Benchmarks report # of ops (#67436)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67436

This information is useful for comparing static runtime to c2

Reviewed By: d1jang

Differential Revision: D31991571

fbshipit-source-id: eb83bc4564b05d56fb9a550863eea3f6312f3f6c
2021-10-28 13:03:09 -07:00
Mike Iovine
72e25c9f4e [Static Runtime][DI] Add variadic grouped_accessor_op (#66289)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66289

Add a variadic version of `grouped_accessor_op` to eliminate list construction overhead and associated refcount bumps in static runtime.

Test Plan:
Accuracy test with model 294738512_40: passes with 0 errors.
Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors.

**Perf impact**

TW replayer test w/ 800 QPS (stacked with D31620408) shows ~5% CPU decrease for storage tier.
Results:

{F673610665}

Reviewed By: hlu1

Differential Revision: D31482816

fbshipit-source-id: 14393da122cefd094c3e4f423beb897c1d17b32c
2021-10-27 12:29:33 -07:00
Scott Wolchok
6ce14e7b51 [PyTorch][Static Runtime] Cleanup: add valueVecFromFastSet (#66996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66996

We do this conversion a few times, and further diffs (which I'm trying to keep as small as possible) will do it more.
ghstack-source-id: 141496817

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D31821037

fbshipit-source-id: 1d3b54cadaedd53189aec6a35ed1a126c6fe4824
2021-10-26 14:47:15 -07:00
Mike Iovine
83355f9537 [SR][easy] Alias for c10::Symbol::fromQualString (#67162)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67162

It's a bit annoying/ugly to type `c10::Symbol::fromQualString` everywhere, and we can't do `using c10::Symbol::fromQualString` since it's a static class function.

Test Plan: CI

Reviewed By: d1jang

Differential Revision: D31887042

fbshipit-source-id: 073a56c72281c20284a9feef741aed96b58a921d
2021-10-26 06:09:17 -07:00
Hao Lu
0c1b7545b6 [Static Runtime] Add more debug info to verify_no_memory_overlap() (#67206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67206

The memory overlap check still checks the memory overlap for alias ops. It only skips the check for inplace ops. This needs to be fixed if we want to use the memory overlap check in prod.

This diff only adds more debug info. It doesn't fix the aforementioned problem.

Reviewed By: d1jang

Differential Revision: D31889866

fbshipit-source-id: 05a80ace3d404f66f21a8bbdc9678485ff76c8d3
2021-10-26 01:48:41 -07:00
Mike Iovine
a0495b3cdb [SR] Remove unused operator() overload (#67001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001

The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.

This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

`buck test caffe2/test:static_runtime`

Reviewed By: hlu1

Differential Revision: D31821990

fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
2021-10-25 08:18:58 -07:00
Mike Iovine
364645cd9d [SR] Factor operator() implementation into separate function (#67125)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67125

Using explicit template instantiations in D31659973 (f2582a59d0) was a bad idea. The problem is that the lvalue instantiation was for a `const` vector of `IValue`, meaning that if you tried to pass SR a non-const vector of arguments, the linker would fail to find the symbol.

The reason we didn't catch this in D31659973 (f2582a59d0) was because predictor always passes a `const` reference anyways. But we should fix this to prevent unexpected problems in the future.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D31873406

fbshipit-source-id: 5ab5a03334bed925cec11facadcedf9bec9b90ad
2021-10-25 08:17:40 -07:00
Mike Iovine
f2582a59d0 [SR] Add rvalue overload for operator() (#66648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648

Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:

1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```

The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.

Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.

Reviewed By: hlu1

Differential Revision: D31659973

fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
2021-10-22 10:51:47 -07:00
Mike Iovine
391eb1dbe3 [JIT] UseVariadicOp handles multiple lists (#66288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66288

This change makes it so `UseVariadicOp` can transform ops with many Tensor list inputs.

Input pattern:
```
%output : Type = op(%list_1, %arg_1, %list_2, %list_3)
```
Output pattern:
```
%output : Type = variadic_op(%list_11, ..., %list_1N, %arg_1, %list_21, ..., %list_2M, %list_31, ..., %list_3K, N, M, K)
```
The length of each list is passed at the end of the variadic op so that the op implementation can process the inputs appropriately. This also frees us from needing to update `hasVarArgs` in static runtime each time we add a variadic op.

This diff also makes `UseVariadicOp` more robust. Before, `list_idx` was passed as an argument. Now, `VariadicUpdater` determines `list_idx` from the node's schema.

Test Plan:
Existing variadic ops do not break:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31450811

fbshipit-source-id: 808fcc3ae8940b9e602586f38f8cf9154c9a6462
2021-10-22 10:22:33 -07:00
Don Jang
051ea5ccbf [Static Runtime] Bundle function & function_kind to carry them together (#66974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66974

`D31591785 (67e003f09b)` started carrying a function object to be executed and `FunctionKind` for the type of the function *separately*, and this caused a bug fixed by D31783028 (79803b199f).

This change bundles them as it was before done by swolchok to reduce the chances of such a mistake in the future. They need to be carried altogether always since `FunctionKind` identifies the type of the function object.

Note that `struct Function` is a POD type, so accessing its field (first, second) shouldn't cause an extra overhead in `ProcessedNode::run()`.

Test Plan:
Confirmed that the managed memory metics remain the same before/after this diff on inline_cvr:

```
#AFTER
# inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
# inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)
# inline_cvr/remote_ro
First iter time: 12.0344 ms
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

```
#BEFORE
#  inline_cvr/local
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

#inline_cvr/local_ro
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2679
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%)

#inline_cvr_remote_ro
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
```

Reviewed By: mikeiovine

Differential Revision: D31798419

fbshipit-source-id: fd4301b6731e402be0820729654735c791511aba
2021-10-22 08:57:49 -07:00
Mike Iovine
ab1e4eac42 [Static Runtime] Add FuseListUnpackV2 (#66509)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66509

Like `FuseListUnpack`, but instead of adding arguments to the fused node's outputs, inserts a new fused op.

By using a new fused op, we can avoid runtime `is_fused` checks. This will make the op implementations significantly cleaner. Eventually, we will migrate all ops to `V2` and delete to old pass.

`FuseListUnpackV2` also fixes the bug described in T103159043.

Test Plan: I've made some changes to D31550307 locally and verified that everything works.

Reviewed By: hlu1

Differential Revision: D31492017

fbshipit-source-id: 4f90fcbc17e4c70a3d65985bee836fabf868a22c
2021-10-20 16:39:32 -07:00
Don Jang
67e003f09b [Static Runtime] Determine function for ProcessedNode::run() statically (#66692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66692

Currently `ProcessedNode::run()` performs 2 dynamic dispatches to decide which function implementation to execute depending on if the function is an out variant / native / or interpreter fallback. Note that this is happening every time an operation is executed by Static Runtime dynamically.

This change makes *that* same decision during module loading time once so that we can remove 1 dynamic dispatch cost at runtime.

**size reduction**

Saving 4 bytes per `ProcessedNode`.

- Before: sizeof(c10::variant<OutVariant, NativeFunction, Operation>):40

- After: sizeof(std::function<void(ProcessedNode*)>): 32 + sizeof(FunctionKind):4 = 36

**latency optimization**

Expected to remove 2 memory loads & 1 conditional jump per `ProcessedNode::run()` execution (needs to be confirmed from compiled binary code).

Ran `ptvsc2_predictor_bench` with `inline_cvr` with 1000 iterations:
- local : 7.56026 -> 7.24794
- local_ro: 1.5799. -> 1.55504.
- remote_ro: 10.6464 -> 10.3017

Test Plan: Ran existing unittests

Reviewed By: swolchok

Differential Revision: D31591785

fbshipit-source-id: 5de83ca386af509381e08ecedf071ee4e9f0f0b0
2021-10-15 14:07:24 -07:00
Scott Wolchok
e88d1c4f10 [PyTorch] Add tuple inline storage (#64066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066

I noticed a bunch of time being spent heap-allocating Tuples
in the unpickler. 1-, 2-, and 3-element Tuples are apparently common
enough that they get their own bytecode instructions, so I decided to
try also giving them their own representation. We store up to 3
IValues inline in `Tuple` rather than doing a second heap allocation
for a `std::vector<IValue>`.
ghstack-source-id: 140695395

Test Plan:
Added automated tests for TupleElements.

Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284
Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422
We went from 347 ms to 302 ms.

Reviewed By: dhruvbird

Differential Revision: D30592622

fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8
2021-10-15 12:16:51 -07:00
Hao Lu
6310eb30d1 [SR] Clean up GetLivenessMap (#66606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66606

- Remove dead code (see comment for where)
- Add debug prints
- Small reorganization of the code to improve readability

Reviewed By: d1jang

Differential Revision: D31568219

fbshipit-source-id: 50240c325bf4fd012e1947ac931bb67c6f5dfafb
2021-10-13 23:55:40 -07:00
Hao Lu
6634570aef [SR] Fix bug in ValueGroup (#66470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470

Reviewed By: d1jang

Differential Revision: D31566348

fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3
2021-10-13 19:26:38 -07:00
Scott Wolchok
d30397d42a [PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429

The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722

Test Plan:
CI

I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.

Reviewed By: hlu1

Differential Revision: D31096042

fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
2021-10-13 19:13:20 -07:00
Don Jang
736fa09a9a [Static Runtime] Manage output tensors (#65515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515

This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:

- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output  `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for  inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
-  The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.

This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.

Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.

- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).

Reviewed By: hlu1

Differential Revision: D31049221

fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
2021-10-11 09:50:54 -07:00
Scott Wolchok
5a67ffe0ad [PyTorch][Static Runtime] Combine ProcessedNode::{native_,}fn_ (#65414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65414

Saves 24 bytes (`sizeof(std::function) - 8`) per ProcessedNode.
ghstack-source-id: 139999909

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D31085561

fbshipit-source-id: 70734b8319e805736ba41aedaaf7fa3d463400c9
2021-10-08 18:11:59 -07:00
Scott Wolchok
3ef69a4598 [static runtime] Pre-allocate hash tables (#65343)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65343

No reason not to save a bit on re-hashing.
ghstack-source-id: 140052518

Test Plan:
CI

Static runtime startup seems to go from 5.9-6.0s to 5.8s-6.0s, perf shows less time spent rehashing

Reviewed By: mikeiovine

Differential Revision: D31027362

fbshipit-source-id: 39dd53ecd462693b518535856ddd92df78a4977b
2021-10-08 10:28:13 -07:00
Don Jang
416f593080 [Static Runtime] Group graph nodes into input aliases & output aliases (#65517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517

This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:

- input_aliases:  values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.

`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.

Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.

Reviewed By: hlu1

Differential Revision: D30940955

fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
2021-10-07 14:35:12 -07:00
Mike Iovine
057a01556c [Static Runtime] Do not use variadic_sigrid_transforms_torch_bind if out variant is disabled (#66221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66221

JIT doesn't have an implementation for this op, so we can only use it when out variants are enabled.

Reviewed By: hlu1

Differential Revision: D31445887

fbshipit-source-id: 4565ac4df751d8ee4052647574c43efa05ea1452
2021-10-07 06:57:17 -07:00
Mike Iovine
a5e6b2b2e3 [Static Runtime] Add variadic sigrid_transforms_torch_bind (#63960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63960

Reviewed By: hlu1

Differential Revision: D30529880

fbshipit-source-id: 1c4be2f9c0944bbe1e1c146989588c96bfd14eda
2021-10-05 16:00:36 -07:00
Hao Lu
a6ad2b41ac [Static Runtime] Make module_ optional in StaticModule (#65882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65882

`torch::jit::Module` is refcounted. There is no need to wrap it in a `shared_ptr`.

Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: mikeiovine

Differential Revision: D31012222

fbshipit-source-id: 74d234bd85423e5ba0e396f24899631354a2c74b
2021-09-30 22:48:49 -07:00
Don Jang
4176afc4a0 [Static Runtime] Disable SigridTransform + ListUnpack fusion when outputs reachable from graph output (#62697)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62697

Reviewed By: hlu1

Differential Revision: D29979402

fbshipit-source-id: 913e8396a0530ce3617211112a2b1147ef2e9df9
2021-09-29 22:47:48 -07:00
Mike Iovine
b003b2a9c0 [Static Runtime] Add record functions (#64698)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64698

Reviewed By: hlu1

Differential Revision: D30747191

fbshipit-source-id: 7ded6ea9bd36b5e3343d1efa9f3c92e02ff6d7f8
2021-09-24 07:20:17 -07:00