Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249
- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.
Reviewed By: ajyu
Differential Revision: D25992202
fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48718
This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check https://github.com/pytorch/rfcs/pull/9 for a mostly up-to-date high level description of what's going on here.
High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D25278031
Pulled By: ezyang
fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46896
The idea of the memory model is quite similar to that of BlackBoxPredictor, however, it's more complicated in pt due to 1) tensor views that share storage with storage refcount bumps but with different TensorImpls, 2) tensors sharing the same TensorImpl and the same storage, but with no refcount bump of the StorageImpl, 3) data types such as TensorList and Tuples that have Tensors in them, 4) need to support non-out/out variant mix while we move the aten ops to out variants.
As a result, I have to make the following adjustments:
1) remove tensors in output Tuples from internal blob list;
2) for memory allocation/deallocation, get candidate Tensors from the outputs of ops with out variant, extract StorageImpls from the Tensors, dedup, and remove output tensor StorageImpls, and get the final list of blobs for memory planning;
3) during the clean_up_memory pass, clean up memory held by the StorageImpls as well as Tensors/Lists/Tuples in IValues that don't participate in memory planning to reduce overall memory usage
Risk:
PyTorch team is planning to deprecate the current resize_outout api, which we do rely on. This is a pretty big risk.
https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/aten/src/ATen/native/Resize.cpp?commit=6457b329847607553d34e788a3a7092f41f38895&lines=9-23
Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Benchmarks:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=false
```
|pt_cleanup_activations |pt_enable_out_variant |old ms/iter |new ms/iter |
|--- |--- |--- |--- |
|0 |0 |0.31873 |0.30228 |
|0 |1 |0.30018 |0.29184 |
|1 |0 |0.35246 |0.31895 |
|1 |1 |0.35742 |0.30417 |
Reviewed By: bwasti, raziel
Differential Revision: D24471854
fbshipit-source-id: 4ac37dca7d2a0c362120a7f02fd3995460c9a55c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219
- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
explicit InferenceModule(const torch::jit::Module& m);
explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
torch::jit::Module module;
std::shared_ptr<torch::jit::Graph> graph;
std::unique_ptr<c10::FunctionSchema> schema;
std::unordered_map<Value*, size_t> value_to_reg;
std::vector<size_t> input_regs; // inputs to the graph
std::vector<size_t> output_regs; // outputs of the graph
std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
mutable std::vector<IValue> reg_;
// The nodes we need to run
std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.
- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.
- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.
This diff focuses on threading model only. Benchmarks will be separate.
Reviewed By: bwasti
Differential Revision: D24237078
fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308
This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.
Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
L1 Data 32K (x24)
L1 Instruction 32K (x24)
L2 Unified 4096K (x24)
L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1 59518 ns 59500 ns 10909
BM_deep_wide_base/8 74635 ns 74632 ns 9317
BM_deep_wide_base/20 82186 ns 82147 ns 9119
BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new
BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new
BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new
BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835
BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529
BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769
BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309
BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199
BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010
BM_deep_wide_static/1 40324 ns 40318 ns 17232
BM_deep_wide_static/8 50327 ns 50319 ns 13335
BM_deep_wide_static/20 53075 ns 53071 ns 12855
BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008
```
PS: The implementation could probably be optimized even more.
Differential Revision: D24300702
Test Plan: Imported from OSS
Reviewed By: dzhulgakov
Pulled By: ZolotukhinM
fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647
Nothing fancy, just a basic implementation of the graph executor without using stack machine.
Reviewed By: bwasti
Differential Revision: D23208413
fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356