Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438
August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime
Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```
```
TEST trace
Total num requests 100
Num exceptions 0
Latency us avg 180965
Latency us p25 89785
Latency us p50 131240
Latency us p75 146621
Latency us p90 158378
Latency us p95 166628
Latency us p99 1886680
Latency us p100 3803252
Server latency us avg 91554
Server latency us p25 51447
Server latency us p50 86371
Server latency us p75 95229
Server latency us p90 102706
Server latency us p95 116023
Server latency us p99 557017
Server latency us p100 716319
Num rankUnits avg 28
```
Reviewed By: hlu1
Differential Revision: D27236682
fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920
Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.
The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.
Reviewed By: ajyu
Differential Revision: D26220546
fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111
If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.
Reviewed By: edvgha
Differential Revision: D27036077
fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799
Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.
Reviewed By: ajyu
Differential Revision: D26920008
fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564
Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.
This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.
```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```
Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D25592967
fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684
With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations. `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations. Ideally every node will run out of place every time. Afaik, there's never a reason to disable it and we may want to remove that functionality.
This diff does introduce a "leak" in the memory management - containers are not cleaned up. This only happens when out variants are enabled
Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled
Reviewed By: maratsubkhankulov, hlu1
Differential Revision: D26515801
fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50060
Aliasing is currently mishandled in SR.
This diff fixes that issue entirely and allows us to avoid hard coded "view" registration. I'll remove the macro in a follow up diff.
However, this diff introduces a subtle assumption when memory optimization is turned on: operators cannot "sometimes alias." Some care will need to be taken to actually make sure this is enforced going forward.
This diff
```
$ batch=20 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.512114. Iters per second: 1952.69
PyTorch run finished. Milliseconds per iter: 0.51176. Iters per second: 1954.04
$ batch=20 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.511402. Iters per second: 1955.41
PyTorch run finished. Milliseconds per iter: 0.506493. Iters per second: 1974.36
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=false |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0562877. Iters per second: 17765.9
PyTorch run finished. Milliseconds per iter: 0.0667712. Iters per second: 14976.5
$ batch=1 iters=100000 ./run.sh --pt_optimize_memory=true |& grep "finished"
C2 run finished. Milliseconds per iter: 0.0561829. Iters per second: 17799
PyTorch run finished. Milliseconds per iter: 0.0665069. Iters per second: 15036
```
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: eellison
Differential Revision: D25581156
fbshipit-source-id: 41e68119d53e687a9c32d966ed420b270aea4b5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52237
Redo D26331506 (4c58be4573). Get rid of `nodiscard` which broke OSS CI.
- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.
This diff does not try to fix the alias problem with the MemoryPlanner.
Reviewed By: swolchok
Differential Revision: D26432539
fbshipit-source-id: e08990e4066c1ce69ad5274860851d012b7be411
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51991
- Clean up references of outputs, including Tuples/Lists, by using move semantics
- Clean up references of elements in output Tuples/Lists by adding them to `unmanaged_values_` in MemoryPlanner. Check for corner case of Tuple/List element being inputs.
- Modify unit tests to check for use_counts of outputs
- Clean up dead code. A bit overlap with D25592967, but shouldn't be a problem.
This diff does not try to fix the alias problem with the MemoryPlanner.
(Note: this ignores all push blocking failures!)
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```
Reviewed By: bwasti
Differential Revision: D26333953
fbshipit-source-id: cadc0595ad6ab754c4f1f7a5a3733b2c16b3102f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51952
StaticRuntime should not hold owning refs of inputs after inference is finished. This diff adds a pass to clean them up and unit tests to enforce the check.
Will clean up output tensors in separate diffs.
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test
```
Reviewed By: bwasti
Differential Revision: D26331506
fbshipit-source-id: d395a295ada9de3033d0ea05d1dbab62d879a03b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249
- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.
Reviewed By: ajyu
Differential Revision: D25992202
fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050
Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.
Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime
Reviewed By: hlu1
Differential Revision: D25571181
fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49340
This refines the fusion group to include on certain types of operations. We cannot safely handle "canRunNatively" types and the memonger pass causes regressions on some internal models, so it was disabled (to be revisited with proper memory optimization once Tensor pools are implemented)
Test Plan:
```
buck test mode/no-gpu caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ZolotukhinM
Differential Revision: D25520105
fbshipit-source-id: add61d103e4f8b4615f5402e760893ef759a60a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49199
This should save us an extra round of dispatch for resize_,
resize_as_, detach_, and copy_, at the cost of disabling profiling and
tracing. I'm told that static runtime has its own per-op profiling and
we don't need tracing.
ghstack-source-id: 118348314
Test Plan:
Code review to confirm lack of need for profiling &
tracing, and that there isn't a different switch we should be using
instead.
Internal benchmarks -- seeing 11-12% improvement in overall runtime
Reviewed By: hlu1
Differential Revision: D25476819
fbshipit-source-id: 71e2c919b386b25c41084e2e4a54fe765a4f8f22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616
This adds a couple of _out variants and then registers them to the registry.
I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).
In the future we can change this (with this D25062301)
after removing `RecordFunction`, we see these results
```
BS=20
---
caffe2: 0.651617 ~ 0.666354
static runtime: 0.753481
pytorch: 0.866658
BS=1
---
caffe2: 0.0858684 ~ 0.08633
static runtime: 0.209897
pytorch: 0.232694
```
Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)
Reviewed By: hlu1
Differential Revision: D25066823
fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48861
`std::function` already has an empty state; no need to wrap
it in `c10::Optional`.
ghstack-source-id: 117891382
Reviewed By: hlu1
Differential Revision: D25296912
fbshipit-source-id: 8291bcf11735d49db17415b5de915591ee65f781
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48161
- Register BlackBoxPredictor AllocationArenaPool as CPUCachingAllocator
- Use the AllocationArenaPool in both BlackBoxPredictor and StaticRuntime
Test Plan:
```
buck run //caffe2/caffe2/fb/predictor:black_box_predictor_test
buck run //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
AF canary:
https://www.internalfb.com/intern/ads/canary/431021257540238874/
Reviewed By: dzhulgakov
Differential Revision: D24977611
fbshipit-source-id: 33ba596b43c1e558c3ab237a0feeae93565b2d35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896
The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.
As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage
Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.
https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23
Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```
|pt_cleanup_activations |pt_enable_out_variant |old ms/iter |new ms/iter |
|--- |--- |--- |--- |
|0 |0 |0.31873 |0.30228 |
|0 |1 |0.30018 |0.29184 |
|1 |0 |0.35246 |0.31895 |
|1 |1 |0.35742 |0.30417 |
Reviewed By: bwasti, raziel
Differential Revision: D24471854
fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219
- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
explicit InferenceModule(const torch::jit::Module& m);
explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
torch::jit::Module module;
std::shared_ptr<torch::jit::Graph> graph;
std::unique_ptr<c10::FunctionSchema> schema;
std::unordered_map<Value*, size_t> value_to_reg;
std::vector<size_t> input_regs; // inputs to the graph
std::vector<size_t> output_regs; // outputs of the graph
std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
mutable std::vector<IValue> reg_;
// The nodes we need to run
std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.
- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.
- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.
This diff focuses on threading model only. Benchmarks will be separate.
Reviewed By: bwasti
Differential Revision: D24237078
fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308
This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.
Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 4096K (x24)
L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1 59518 ns 59500 ns 10909
BM_deep_wide_base/8 74635 ns 74632 ns 9317
BM_deep_wide_base/20 82186 ns 82147 ns 9119
BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new
BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new
BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new
BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835
BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529
BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769
BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309
BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199
BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010
BM_deep_wide_static/1 40324 ns 40318 ns 17232
BM_deep_wide_static/8 50327 ns 50319 ns 13335
BM_deep_wide_static/20 53075 ns 53071 ns 12855
BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008
```
PS: The implementation could probably be optimized even more.
Differential Revision: D24300702
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Pulled By: ZolotukhinM
fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45892
Previously we were using hashtable (`std::unordered_map` in OSS, `folly::F14FastMap` in fb) for workspace, a container for all the IValues in the graph. Hashtable based lookups can be expensive. This diff replaces the hashtable with `std::vector` and extra bookkeepings are introduced to keep track of the indices of graph inputs/outputs in `StaticRuntime` and op inputs/outputs in `ProcessedNode`.
Reviewed By: dzhulgakov
Differential Revision: D24098763
fbshipit-source-id: 337f835ee144985029b5fa2ab98f9bcc5e3606b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45639
`StaticRuntime::run_individual` is to mimic the caffe2 operator benchmark `SimpleNet::TEST_Benchmark`, so we can accurate information on the operator breakdown. We found that the PyTorch AutogradProfiler adds a lot of overhead to small models, such as the adindexer precomputation_merge net, 100% for batch_size 1, 33% for batch_size 20. This implementation adds very little overhead, as shown in the test plan.
Test Plan: Test results are fb internal only.
Reviewed By: yinghai, dzhulgakov
Differential Revision: D24012088
fbshipit-source-id: f32eb420aace93e2de421a15e4209fce6a3d90f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647
Nothing fancy, just a basic implementation of the graph executor without using stack machine.
Reviewed By: bwasti
Differential Revision: D23208413
fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43182
We should avoid using `deepcopy` on the module because it involves copying the weights.
Comparing the implementation of `c10::ivalue::Object::copy()` vs `c10::ivalue::Object::deepcopy()`, the only difference is `deepcopy` copies the attributes (slots) while `copy` does not.
Reviewed By: bwasti
Differential Revision: D23171770
fbshipit-source-id: 3cd711c6a2a19ea31d1ac1ab2703a0248b5a4ef3
Summary:
This PR whitelists and simplifies graphs to help with development later on. Key to note in this PR is the use of both a pattern substitution and the registration of custom operators. This will likely be one of the main optimization types done in this folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43024
Reviewed By: hlu1
Differential Revision: D23114262
Pulled By: bwasti
fbshipit-source-id: e25aa3564dcc8a2b48cfd1561b3ee2a4780ae462