Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414
Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318
Test Plan:
compiles.
Imported from OSS
Reviewed By: ejguan
Differential Revision: D30375410
fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101
Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.
There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.
Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: movefast1990
Differential Revision: D30595579
fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078
This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.
Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
at::Tensor mean = create_empty_from({M}, *X);
at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.
This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.
Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.
- Confirmed that the new op gets activated during testing:
```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)
```
Reviewed By: hlu1
Differential Revision: D30486475
fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63999
Use folly::F14FastMap/F14FastSet instead of std::unordered_map/unordered_set in the Static Runtime code base. folly::F14FastMap/F14FastSet implements the same APIs as std::unordered_map/unordered_set but faster. For details see https://github.com/facebook/folly/blob/master/folly/container/F14.md
Reviewed By: d1jang
Differential Revision: D30566149
fbshipit-source-id: 20a7fa2519e4dde96fb3fc61ef6c92bf6d759383
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579
Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`
Reviewed By: navahgar
Differential Revision: D30410525
fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63704
Previously SR did not clone the graph. This was leading to subtle bugs in `testStaticRuntime`; static runtime would modify its graph, and the graph used by the JIT interpreter would change as well. The JIT interpreter would then crash if SR-only ops were added!
Cloning the graph is more consistent with the behavior of the `Module` ctor.
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: hlu1
Differential Revision: D30463294
fbshipit-source-id: b771551a1f55f95fde79373b23babcf3e5ddf726
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63577
Since other variadic ops will have an almost identical implementation, we can generalize the `UseVariadicCat` implementation and put it in a common folder.
Also moved some test utilities that other variadic op tests will likely need.
Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOptTest`
Reviewed By: navahgar
Differential Revision: D30409937
fbshipit-source-id: 925c11c27b58ce98cb8368d2a205e26ba66d3db9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63346
We have seen that we can get significant perf wins essentially for free by implementing native ops for ops that we cannot write out variants for (e.g. TupleUnpack D30306955 (078b8004a6), append D30326461 (9d9e7a8d72)). Therefore, whether or not SR is using a native implementation is valuable information. By capturing this in the benchmarking suite, we can hopefully avoid wasting time profiling/manually inspecting `native_ops.cpp`
Reviewed By: hlu1
Differential Revision: D30346752
fbshipit-source-id: 205b090513b6a5a6ce4cb92f75ab0395b15d08f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63012
This change fixes a bug that the static runtime's memory optimizer assigns multiple outputs of a node to the same storage. Fixing this bug enables the static runtime to run `inline_cvr` with its memory optimizer enabled.
A problematic line from `inline_cvr` was as follows:
```
%7767 : Tensor, %getitem_6419.1 : Tensor = fb::gather_ranges(%tensor74.1, %7764)
```
where enabling the memory optimizer assigns `%7767` and `%getitem_6419.1` to the same storage, which made their data corrupted during the 2nd iteration.
This change fixed the aforementioned bug by marking all inputs & outputs of a node as `alive` during our liveness analysis. By doing that, no inputs / outputs will collide with each other. I believe this is a fair assumption that most ops' implementation always has, but missing in our analysis before this change.
Test Plan: - Added a unittest `StaticRuntime.ValuesShareSameStorageDoesNotContainOutputsFromSameNode` to cover the new code.
Reviewed By: hlu1
Differential Revision: D30202018
fbshipit-source-id: 10287a1bee9e86be16a5201e9a7cd7c7f046bab9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63256
This suppresses printing out the per node time which is very long when the net has too many ops. It can be easily turned on by setting `--pt_sr_print_per_node_time=1`.
Reviewed By: ajyu, mikeiovine
Differential Revision: D30298331
fbshipit-source-id: 32b3f93b3fe19d335654168311fda93331a1e706
Summary:
Replace for loop with for `irange` loop. Also fix some unused variable warnings in range loop cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62928
Reviewed By: driazati
Differential Revision: D30171904
Pulled By: malfet
fbshipit-source-id: 1b437a0f7e3515f4a2e324f3450e93312f1933ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301
This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs.
Test Plan:
Added unittests as follows:
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments`
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments`
Reviewed By: hlu1
Differential Revision: D29564158
fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61506
Separate out the logic of GetAlwaysAliveValues from GetLivenessMap so to simplify the code structure. Also you don't need to run GetLivenessMap if optimize_memory is turned off.
Reviewed By: ajyu
Differential Revision: D29423534
fbshipit-source-id: dbdeeb10f7bcad86a24aa12f741f7c9ab946bb3b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505
The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.
Reviewed By: ajyu
Differential Revision: D29350173
fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669
Test Plan: Added unit test to check for nested outputs.
Reviewed By: ajyu
Differential Revision: D29322025
fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829
- Delete copying and moving of MemoryPlanner.
- Remove `inline` in some of the member functions because member functions implemented in classes are inline by default.
- Clean up ad update comments.
- Reorganize some code
Reviewed By: edvgha
Differential Revision: D28555476
fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57972
Allow static runtime to be on when glow is on. This should be fine as long as glow AOT has already been run.
Test Plan: Test on replayer with remote_other net. D28291326 fixes remaining issue removing loops from the remote_other model. Need to test on regenerated model.
Reviewed By: hlu1
Differential Revision: D28275514
fbshipit-source-id: ee78972660dfdc3fcfb9af2cf7ebb19ee745a4f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521
When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.
This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:
```
if (node->kind() != prim::ListConstruct &&
node->kind() != prim::TupleConstruct &&
node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
const Operator& op = node->getOperator();
TORCH_CHECK(op.hasOperation());
op_ = op.getOperation(node);
VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
}
```
The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.
(Note: this ignores all push blocking failures!)
Reviewed By: ajyu
Differential Revision: D27531447
fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27974999
fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56839
Enable check_for_memory_leak at the end of StaticRuntime::benchmark so this code is exercised more often.
Test Plan: Checked with adindexer merge net model
Reviewed By: edvgha
Differential Revision: D27417911
fbshipit-source-id: 5248942dc439fcc7301ffb0005da76374939fa96
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27902824
fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447
MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.
Test Plan: CI
Reviewed By: ajyu
Differential Revision: D27840361
fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811
- Added manage_graph_output_memory flag to opts (default false)
- Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory
- Minor refactoring for readability
Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime
Reviewed By: hlu1
Differential Revision: D27573780
fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438
August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime
Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```
```
TEST trace
Total num requests 100
Num exceptions 0
Latency us avg 180965
Latency us p25 89785
Latency us p50 131240
Latency us p75 146621
Latency us p90 158378
Latency us p95 166628
Latency us p99 1886680
Latency us p100 3803252
Server latency us avg 91554
Server latency us p25 51447
Server latency us p50 86371
Server latency us p75 95229
Server latency us p90 102706
Server latency us p95 116023
Server latency us p99 557017
Server latency us p100 716319
Num rankUnits avg 28
```
Reviewed By: hlu1
Differential Revision: D27236682
fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53920
Fusing SigridTransforms + ListUnpack allows for enabling out variant for SigridTransforms so that the output tensors can be managed by the MemoryPlanner in Static Runtime.
The speedup comes from three parts 1) get rid of memory allocation inside SigridTransforms itself, 2) memory deallocation cost (outside SigridTransforms, inside MemoryPlanner), 3) get rid of ListUnpack. However, in 3) we still need to pay the cost of constructing `vector<Tensor>` for outputs and a round of refcount bumps for all the output TensorImpls.
Reviewed By: ajyu
Differential Revision: D26220546
fbshipit-source-id: 651bdfb850225511c43b8f50083b13e8dec46bcc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54111
If we only run the ReplaceWithCopy pass when enable_out_variant is true, there is no need register a default op implementation.
Reviewed By: edvgha
Differential Revision: D27036077
fbshipit-source-id: f615f5d8b84629044af1c554421ea5e505e93239
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53799
Fix two issues with ClipRangesGatherRangesX2SigridHash and ClipRangesGatherRangesX2SigridHashPrecompute:
- The first issue is with the two step graph rewrite process. If step 2 doesn't happen after step 1, then we're stuck with a graph with a `fb::placeholder` op that can't run. Step 3 is added to revert step 1 so we restore the original graph if there's any `fb::placeholder` op left.
- The second issue is with `SigridHashPrecompute`. The coupling with `freeze_module` is not ideal and limits its use to Static Runtime only. By running `ConstantPropagation` and `ConstantPooling` after splitting SigridHash, we can move all the Constant ops to the front of the graph and fusion can happen right afterwards.
Reviewed By: ajyu
Differential Revision: D26920008
fbshipit-source-id: e4bc67c7a15181bac5dbbfbb95d861849652bddf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564
Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.
This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.
```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```
Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D25592967
fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f