Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829
- Delete copying and moving of MemoryPlanner.
- Remove `inline` in some of the member functions because member functions implemented in classes are inline by default.
- Clean up ad update comments.
- Reorganize some code
Reviewed By: edvgha
Differential Revision: D28555476
fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57972
Allow static runtime to be on when glow is on. This should be fine as long as glow AOT has already been run.
Test Plan: Test on replayer with remote_other net. D28291326 fixes remaining issue removing loops from the remote_other model. Need to test on regenerated model.
Reviewed By: hlu1
Differential Revision: D28275514
fbshipit-source-id: ee78972660dfdc3fcfb9af2cf7ebb19ee745a4f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521
When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.
This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:
```
if (node->kind() != prim::ListConstruct &&
node->kind() != prim::TupleConstruct &&
node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
const Operator& op = node->getOperator();
TORCH_CHECK(op.hasOperation());
op_ = op.getOperation(node);
VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
}
```
The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.
(Note: this ignores all push blocking failures!)
Reviewed By: ajyu
Differential Revision: D27531447
fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27974999
fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56839
Enable check_for_memory_leak at the end of StaticRuntime::benchmark so this code is exercised more often.
Test Plan: Checked with adindexer merge net model
Reviewed By: edvgha
Differential Revision: D27417911
fbshipit-source-id: 5248942dc439fcc7301ffb0005da76374939fa96
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27902824
fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447
MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.
Test Plan: CI
Reviewed By: ajyu
Differential Revision: D27840361
fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811
- Added manage_graph_output_memory flag to opts (default false)
- Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory
- Minor refactoring for readability
Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime
Reviewed By: hlu1
Differential Revision: D27573780
fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438
August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime
Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```
```
TEST trace
Total num requests 100
Num exceptions 0
Latency us avg 180965
Latency us p25 89785
Latency us p50 131240
Latency us p75 146621
Latency us p90 158378
Latency us p95 166628
Latency us p99 1886680
Latency us p100 3803252
Server latency us avg 91554
Server latency us p25 51447
Server latency us p50 86371
Server latency us p75 95229
Server latency us p90 102706
Server latency us p95 116023
Server latency us p99 557017
Server latency us p100 716319
Num rankUnits avg 28
```
Reviewed By: hlu1
Differential Revision: D27236682
fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920
Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.
The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.
Reviewed By: ajyu
Differential Revision: D26220546
fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111
If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.
Reviewed By: edvgha
Differential Revision: D27036077
fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799
Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.
Reviewed By: ajyu
Differential Revision: D26920008
fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564
Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.
This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.
```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```
Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D25592967
fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684
With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations. `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations. Ideally every node will run out of place every time. Afaik, there's never a reason to disable it and we may want to remove that functionality.
This diff does introduce a "leak" in the memory management - containers are not cleaned up. This only happens when out variants are enabled
Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled
Reviewed By: maratsubkhankulov, hlu1
Differential Revision: D26515801
fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060
Aliasing is currently mishandled in SR.
This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff.
However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward.
This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04
$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: eellison
Differential Revision: D25581156
fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237
Redo D26331506 (4c58be4573). Get rid of `nodiscard` which broke OSS CI.
- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.
This diff does not try to fix the alias problem with the MemoryPlanner.
Reviewed By: swolchok
Differential Revision: D26432539
fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991
- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.
This diff does not try to fix the alias problem with the MemoryPlanner.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```
Reviewed By: bwasti
Differential Revision: D26333953
fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952
StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check.
Will clean up output tensors in separate diffs.
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```
Reviewed By: bwasti
Differential Revision: D26331506
fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249
- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.
Reviewed By: ajyu
Differential Revision: D25992202
fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050
Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.
Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime
Reviewed By: hlu1
Differential Revision: D25571181
fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49340
This refines the fusion group to include on certain types of operations. We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented)
Test Plan:
```
buck test mode/no-gpu caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ZolotukhinM
Differential Revision: D25520105
fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49199
This should save us an extra round of dispatch for resize_,
resize_as_, detach_, and copy_, at the cost of disabling profiling and
tracing. I'm told that static runtime has its own per-op profiling and
we don't need tracing.
ghstack-source-id: 118348314
Test Plan:
Code review to confirm lack of need for profiling &
tracing, and that there isn't a different switch we should be using
instead.
Internal benchmarks -- seeing 11-12% improvement in overall runtime
Reviewed By: hlu1
Differential Revision: D25476819
fbshipit-source-id: 71e2c919b386b25c41084e2e4a54fe765a4f8f22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616
This adds a couple of _out variants and then registers them to the registry.
I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).
In the future we can change this (with this D25062301)
after removing `RecordFunction`, we see these results
```
BS=20
---
caffe2: 0.651617 ~ 0.666354
static runtime: 0.753481
pytorch: 0.866658
BS=1
---
caffe2: 0.0858684 ~ 0.08633
static runtime: 0.209897
pytorch: 0.232694
```
Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)
Reviewed By: hlu1
Differential Revision: D25066823
fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48861
`std::function` already has an empty state; no need to wrap
it in `c10::Optional`.
ghstack-source-id: 117891382
Reviewed By: hlu1
Differential Revision: D25296912
fbshipit-source-id: 8291bcf11735d49db17415b5de915591ee65f781