This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473
Avoid some extra heap allocations by using DimVector
ghstack-source-id: 155569314
Test Plan: Existing unit tests
Reviewed By: navahgar, huiguoo
Differential Revision: D35972439
fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4
(cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450
This change uses `SROperator` for operators' function type
Test Plan: N/A
Reviewed By: mikeiovine
Differential Revision: D34483246
fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247
Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314
Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?
Reviewed By: mikeiovine
Differential Revision: D33556198
fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274
`impl.h` is the main header file that defines the interface of Static Runtime to its clients.
However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.
To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.
This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.
Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.
Test Plan: Existing unittests
Reviewed By: donaldong
Differential Revision: D32780415
fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64075
Test Plan:
Before:
`I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987`
After:
`I0826 17:13:07.464485 1040300 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.46362. Iters per second: 154.712`
Profile after: P453143683
Accuracy tested comparing with jit interpreter for no differences under 1e-3 (nnc ops turned on) https://www.internalfb.com/intern/diff/view-version/136824794/
======
With 100-request recordio inputs (211 inputs)
Before:
`I1101 12:43:13.558375 742187 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.7882. Iters per second: 84.8309`
After:
`I1101 13:50:41.087644 1126186 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.6763. Iters per second: 85.6438`
Profile after: P465977010
Constituent ops before (total is 0.5646):
```
0.187392 ms. 1.61737%. fb::clip_ranges_gather (309 nodes, out variant)
0.174101 ms. 1.50266%. fb::lengths_to_offsets (464 nodes, out variant)
0.203126 ms. 1.75317%. static_runtime::to_copy (805 nodes, out variant)
```
Constitutent ops after (total is 0.4985):
```
0.376559 ms. 3.25614%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0614349 ms. 0.531235%. fb::lengths_to_offsets (159 nodes, out variant)
0.0573315 ms. 0.495751%. static_runtime::to_copy (195 nodes, out variant)
0.00325543 ms. 0.0281501%. fb::gather_ranges (4 nodes, out variant)
```
Compare with jit interpreter inside benchmark:
`I1101 13:55:53.013602 1149446 PtVsBlackBoxPredictorBenchLib.cpp:175] Finished comparing PT static runtime and jit interpreter results`
======
Casting on the fly:
a. Static runtime off
```
Static runtime ms per iter: 11.4658. Iters per second: 87.2159
0.220367 ms. 1.94726%. static_runtime::to_copy (805 nodes, out variant)
0.172585 ms. 1.52504%. fb::clip_ranges_gather (309 nodes, out variant)
0.157836 ms. 1.39471%. fb::lengths_to_offsets (464 nodes, out variant)
```
b. Casting on the fly, using explicit allocation+to_copy (which has the fast pass for certain cases, but we'll always call empty):
```
I1115 09:08:35.711972 1925508 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 11.6732. Iters per second: 85.6662
0.599439 ms. 5.25098%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0552475 ms. 0.483958%. fb::lengths_to_offsets (159 nodes, out variant)
0.0576032 ms. 0.504593%. static_runtime::to_copy (195 nodes, out variant)
0.00299026 ms. 0.0261941%. fb::gather_ranges (4 nodes, out variant)
```
c. Casting on the fly with native::to (no explicit allocation, but no fast pass):
```
Static runtime ms per iter: 11.5627. Iters per second: 86.4849
0.454356 ms. 3.9652%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.06315 ms. 0.551115%. static_runtime::to_copy (195 nodes, out variant)
0.0590741 ms. 0.515544%. fb::lengths_to_offsets (159 nodes, out variant)
0.00359182 ms. 0.031346%. fb::clip_ranges_gather (4 nodes, out variant)
```
d. Removal of the to() call in question from the fusion pattern:
```
Static runtime ms per iter: 11.3658. Iters per second: 87.9836
0.29591 ms. 2.6479%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.154612 ms. 1.38352%. static_runtime::to_copy (500 nodes, out variant)
0.0567151 ms. 0.507505%. fb::lengths_to_offsets (159 nodes, out variant)
0.0051115 ms. 0.0457394%. fb::clip_ranges_gather (4 nodes, out variant)
```
Reviewed By: hlu1
Differential Revision: D30515441
fbshipit-source-id: 53acee10619ac2be7dc8982e929e3210c4bb6d21
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101
Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.
There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.
Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: movefast1990
Differential Revision: D30595579
fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335
This change ensures that unittests only use out variants or native ops.
- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.
- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.
- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op = true` since they are written to use interpreter ops by design.
Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.
Reviewed By: mikeiovine, hlu1
Differential Revision: D29952381
fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505
The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.
Reviewed By: ajyu
Differential Revision: D29350173
fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59603
D28698997 (10345010f7) was reverted because I forgot to replace the
```
VLOG(1) << "Found schema mismatch";
n->schema().dump();
```
block in `aten::clamp_min` with `LogAndDumpSchema(n)` and that led to the bazel build to fail. I don't know why it makes the bazel build though.
Test Plan: OSS CI.
Reviewed By: ajyu
Differential Revision: D28950177
fbshipit-source-id: 9bb1c6619e6b68415a3349f04933c2fcd24cc9a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58067
- Use expect_contiguous in layer_norm to avoid unnecessary refcount bumps when the tensors are contiguous
- Clean up some leftovers from the hacky wrappers removal cleanup: use c10::MaybeOwned<Tensor> for bias tensors
- Skip dispatcher for at::empty in the layer_norm impl in Static Runtime
Test Plan: CI
Reviewed By: swolchok
Differential Revision: D28214298
fbshipit-source-id: 73150fa62d5c18f41a2264f8e56bbe5e377ad045
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58100
aten::clone has a second arg, memory_format, which was not previously supported.
Reviewed By: ajyu
Differential Revision: D28347171
fbshipit-source-id: e083cc24c3228048429bba3497326415bc3d1f5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521
When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.
This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:
```
if (node->kind() != prim::ListConstruct &&
node->kind() != prim::TupleConstruct &&
node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
const Operator& op = node->getOperator();
TORCH_CHECK(op.hasOperation());
op_ = op.getOperation(node);
VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
}
```
The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.
(Note: this ignores all push blocking failures!)
Reviewed By: ajyu
Differential Revision: D27531447
fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57282
Added support for fb::expand_dims for SR.
Test Plan:
buck test caffe2/torch/fb/sparsenn:gpu_test -- test_expand_dims
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
Reviewed By: hlu1
Differential Revision: D28043049
fbshipit-source-id: 01f59db7b507f027b220f044d6ff23602adbdb06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447
MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.
Test Plan: CI
Reviewed By: ajyu
Differential Revision: D27840361
fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337
`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.
Test Plan:
This fixed two of the broken tests:
```
✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```
Reviewed By: ajyu
Differential Revision: D27577066
fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53588
Remove `SRViewOperatorRegistry` and related code now that it's no longer needed.
Reviewed By: swolchok
Differential Revision: D26901367
fbshipit-source-id: fa73501cd785d4b89466cda81481aea892f8241f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564
Constructor logic was spread throughout InferenceModule and StaticRuntime. This diff unifies the two. After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.
This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.
```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```
Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).
Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D25592967
fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53333
- Add more variants to `create_empty_from` to take more args, like dtype/layout/device.
- Clean up stray at::empty uses, mostly in the out variants.
Reviewed By: ajyu
Differential Revision: D26799900
fbshipit-source-id: 6676d8043fead63208913ef3a28cabbae76e46bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53216
- at::native::empty_cpu calls at::detail::empty_cpu without any changes to the arguments. So we could call at::detail::empty_cpu directly.
- There is no need to create a TensorOptions object first since we can get all the relevant information from the tensor directly.
Reviewed By: bertmaher, swolchok
Differential Revision: D26792255
fbshipit-source-id: 7a4e368a19cea79e136e34dab854cb1d37dbeb58
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684
With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations. `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations. Ideally every node will run out of place every time. Afaik, there's never a reason to disable it and we may want to remove that functionality.
This diff does introduce a "leak" in the memory management - containers are not cleaned up. This only happens when out variants are enabled
Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled
Reviewed By: maratsubkhankulov, hlu1
Differential Revision: D26515801
fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249
- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.
Reviewed By: ajyu
Differential Revision: D25992202
fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
Summary:
We do a lot of resize_({0}) to force `out` operators to properly
resize their results, and `resize_` does a fair bit of extraneous work
(e.g. trip through dispatch, checks for memory_format and named tensors, etc.).
If we strip it down to the bare minimum it's just setting the sizes to 0, so
lets do that directly.
Test Plan:
Perf results suggest maybe a 1% win:
```
batch 20: P163138256 (large win, 1.7%, mostly in fb_fc_out)
batch 1: P163139591 (smaller win, 0.88%, mostly in resize_)
```
Reviewed By: swolchok
Differential Revision: D25932595
fbshipit-source-id: d306a0a15c0e1be12fde4a7f149e3ed35665e3c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50395
There's no need for these to be `std::function`.
ghstack-source-id: 119684828
Test Plan: CI
Reviewed By: hlu1
Differential Revision: D25874187
fbshipit-source-id: e9fa3fbc0dca1219ed13904ca704670ce24f7cc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050
Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.
Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime
Reviewed By: hlu1
Differential Revision: D25571181
fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616
This adds a couple of _out variants and then registers them to the registry.
I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).
In the future we can change this (with this D25062301)
after removing `RecordFunction`, we see these results
```
BS=20
---
caffe2: 0.651617 ~ 0.666354
static runtime: 0.753481
pytorch: 0.866658
BS=1
---
caffe2: 0.0858684 ~ 0.08633
static runtime: 0.209897
pytorch: 0.232694
```
Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)
Reviewed By: hlu1
Differential Revision: D25066823
fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80