Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966
It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:
```
def forward(self, cond: bool, a, b):
lst = []
if cond:
res = a + b # res should not be managed!!!
lst.append(res)
return lst
```
The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.
The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.
This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188
Test Plan: New unit tests cover the problematic case
Reviewed By: navahgar
Differential Revision: D35257087
fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927
The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.
Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602
Test Plan: New unit test/CI
Reviewed By: navahgar
Differential Revision: D35228467
fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255
This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.
Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.
Reviewed By: mikeiovine
Differential Revision: D34863639
fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.
At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.
This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```
Reviewed By: mikeiovine
Differential Revision: D35366309
fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987
Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty
Test Plan:
`buck build //caffe2:torch-cpp-cpu`
`buck test //caffe2/benchmarks/static_runtime/...`
Add unit test:
`buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock`
Reviewed By: mikeiovine
Differential Revision: D35262952
fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2
(cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585
Native static runtime for `aten::reshape_as`
ghstack-source-id: 152340038
Test Plan: New unit test
Reviewed By: hlu1
Differential Revision: D35060895
fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701
(cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562
Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do
ghstack-source-id: 152340039
Test Plan: New unit tests
Reviewed By: hlu1
Differential Revision: D35052997
fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42
(cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481
This diff fixes an interesting performance issue related to `permute_copy`.
We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```
With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589
But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597
But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888
Reviewed By: navahgar
Differential Revision: D34992319
fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606
The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116
Test Plan: Existing unit tests
Reviewed By: hlu1
Differential Revision: D34562131
fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990
This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.
This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.
Test Plan: - Added a unittest.
Reviewed By: mikeiovine
Differential Revision: D34768718
fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945
This change adds add out variant wrapper for aten::ones_like.
Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).
Reviewed By: hlu1
Differential Revision: D34727057
fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946
This change adds an out variant wrapper for aten::zeros.
Test Plan:
- Added a unittest.
- Confirmed that the added out variant gets executed by the unittest (P485324923).
Reviewed By: mikeiovine
Differential Revision: D34725843
fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704
Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721
Test Plan: New unit tests
Reviewed By: hlu1
Differential Revision: D34596954
fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851
This change adds an out variant wrapper for aten::ones
Test Plan: Added a unittest
Reviewed By: mikeiovine
Differential Revision: D34557095
fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681
Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627
Test Plan: Added unit test to cover this case
Reviewed By: navahgar, d1jang
Differential Revision: D34588804
fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536
Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:
1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`
2) fully instanced ones that are owned by `StaticRuntime`.
We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.
To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D33985600
fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73574
T113070663 identified a test breakage in `StaticRuntime.autogen__convert_indices_from_csr_to_coo` using `mode/opt`. This change fixes it by using a correct test value.
By generating the out variants for Static Runtime, this change also includes an out variant wrapper for `mse_loss`.
**Generating out variants for Static Runtime**
```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
Changed value //fbcode.sanitizer='address-undefined-dev' (was 'thread')
... and 13 more. See logs for all changes
Parsing buck files: finished in 0.8 sec
Downloaded 14/25 artifacts, 159.45 Kbytes, 30.0% cache miss (for updated rules)
Building: finished in 1.6 sec (100%) 52/52 jobs, 35/52 updated
Total time: 2.5 sec
BUILD SUCCEEDED
total grouped native ops: 1501
structured grouped native ops: 540
generated grouped native ops: 137
```
Test Plan:
Ran the broken test in `mode/opt` and confirmed that it passes now.
```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck test mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --exact 'caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo' --run-disabled
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
Changed value //project.buck_out='buck-out/opt' (was 'buck-out/dev')
... and 307 more. See logs for all changes
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/toolchains/platform009/build/mesa/lib/libGL.so' in library rule 'fbsource//third-party/toolchains:opengl'
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/freeglut/3.0.0/libs/x64-linux/libglut.a' in library rule 'fbsource//third-party/toolchains:GLUT'
I0301 08:28:08.884272 2239319 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000
I0301 08:30:14.751745 2261718 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000
Parsing buck files: finished in 10.1 sec
Creating action graph: finished in 6.1 sec
[RE] Metadata: Session ID=[https://fburl.com/b/reSessionID-fa0ba93b-33a1-4e6f-88f8-9f508d2c27c3]
[RE] Waiting on 0 remote actions. Completed 247 actions remotely, action cache hit rate: 0.00%.
Downloaded 13000/17457 artifacts, 463.99 Mbytes, 2.6% cache miss (for updated rules)
Building: finished in 04:16.6 min (100%) 28628/28628 jobs, 28628/28628 updated
Total time: 04:32.9 min
More details at https://www.internalfb.com/intern/buck/build/c774ff43-5311-49ce-a677-30e3f6afdad1
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 16d9b24c-4a63-4671-84b5-690fac0ee086
Trace available for this run at /tmp/tpx-20220301-083049.472831-16d9b24c-4a63-4671-84b5-690fac0ee086/trace.log
RemoteExecution session id: reSessionID-16d9b24c-4a63-4671-84b5-690fac0ee086-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
✓ ListingSuccess: caffe2/benchmarks/static_runtime:static_runtime_cpptest : 285 tests discovered (0.425)
✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo (0.105)
Summary
Pass: 1
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
```
Reviewed By: mikeiovine
Differential Revision: D34552645
fbshipit-source-id: 36f15b0f29edcb7deb71ba8a6f66ce2532bf7c82
(cherry picked from commit 2329afd8bfc89671cfbd864414e528241e7045fc)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73256
This adds a test that executes multiple Static Runtime instances in parallel
when each instances includes a fusion.
ghstack-source-id: 149787403
Test Plan:
```
buck run mode/dev-asan //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=CpuFusion.ParallelRuntimes
```
The above test results in an error: P482317015 (when parts of the fix in D34287960 (6d33852685) are backed out)
Reviewed By: mikeiovine
Differential Revision: D34404127
fbshipit-source-id: 95a267e27d74584df90841fe496f909171136981
(cherry picked from commit 57d3ad9a46a24559f6d4f4097bd1b8e0b1f6b077)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72603
This change adds out variant dispatchers generated by the previous diff.
The number of the out variant dispatchers generated by this diff is 133, which increases the out variant coverage by 309% (current: 43, this diff: 133 + 43 = 176). This number is expected to increase a lot as we develop this script further to cover more ops.
Test Plan:
**Unittest**
Confirmed
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
is passing.
Reviewed By: swolchok
Differential Revision: D33373928
fbshipit-source-id: 4d94d788282f3f313bb36f2f9452edecd9862246
(cherry picked from commit e4ce8b386d1fcc47b86cb9c9016a70e7a31b452c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587
This pattern frequently appears in a few graphs:
```
%result = prim::If(%condition)
block0():
-> (%a)
block1():
-> (%b)
```
This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.
This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:
```
%result = prim::IfThenElse(%condition, %a, %b)
```
Test Plan: New unit tests
Reviewed By: eellison
Differential Revision: D34091789
fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72598
Apply all optimizations to sub-blocks by replacing loops over `graph->nodes()` with loops over nodes in `DepthFirstGraphNodeIterator`
ghstack-source-id: 149155700
Test Plan: Existing unit tests
Reviewed By: d1jang
Differential Revision: D34111430
fbshipit-source-id: 015076030368bb67df24ed5892475534b8f8f272
(cherry picked from commit a4314520de)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544
Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction.
I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks.
ghstack-source-id: 148825816
Test Plan: New unit tests
Reviewed By: d1jang
Differential Revision: D34083080
fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50
(cherry picked from commit 2ef39e0e54)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530
This bug was revealed from a failed attempt to run a feed/story model.
Test Plan:
- This fix was tested to successfully run the failed model: P479037453
- Added a unittest
Reviewed By: mikeiovine
Differential Revision: D34055801
fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d
(cherry picked from commit bfa2bfb81c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474
The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime.
Ops not included in this diff:
* `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons
* `prim::NumToTensor` and `aten::__get__.Dict` already exist
ghstack-source-id: 148641179
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D33657816
fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e
(cherry picked from commit 33f8f861ff)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807
There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in
```
def forward(x):
y = x.detach()
# Should be false, but we get True
# after our EliminateNoOps pass
return x is y
```
Test Plan: New unit test covers this case
Reviewed By: d1jang
Differential Revision: D33783668
fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4
(cherry picked from commit 019588f4ca)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838
Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483
Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33049595
fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837
Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475
Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`
Accuracy test at top of stack
Reviewed By: d1jang
Differential Revision: D33045908
fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc08)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836
It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned.
Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`.
ghstack-source-id: 148186453
Test Plan: Sub-blocks with no return values tested at top of stack
Reviewed By: d1jang
Differential Revision: D33050420
fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149
(cherry picked from commit c97fd07bf5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835
`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.
Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
Added a new unit test with a graph that simply returns a constant.
Tests with sub-blocks at top of stack.
Reviewed By: d1jang
Differential Revision: D33047519
fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834
* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245
Test Plan: `buck test caffe2/benchmarks/static_runtime/...`
Reviewed By: d1jang
Differential Revision: D33028039
fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd92)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501
This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569
Test Plan: CI, rerun broken test from task
Reviewed By: mikeiovine
Differential Revision: D33669290
fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113
This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff.
Native ops
* `aten::mul` (list variant)
* `aten::sub` (int variant)
* `aten::add` (list variant)
* `aten::Int`
Out variants
* ~~`aten::gt`~~ (codegen will handle)
* ~~`aten::eq`~~ (codegen will handle)
ghstack-source-id: 146927552
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D33510756
fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122
See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.
If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
--do_profile output for local_ro (updated Dec 10):
```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```
for local: (also Dec 10):
```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)
==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```
Reviewed By: hlu1
Differential Revision: D32318674
fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223
ghstack-source-id: 146482215
Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)
Reviewed By: hlu1
Differential Revision: D31776259
fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394
ghstack-source-id: 146464294
Test Plan:
Added new test, which failed but now passes.
Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral
```
Stable, local
========================================
I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1
I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081
I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636
I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557
I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982
I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795
I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164
0
FixToDtypeChanges, local
========================================
I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66
I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257
I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847
I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632
I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787
I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706
```
Reviewed By: hlu1
Differential Revision: D31972722
fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D33336948
Pulled By: albanD
fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210
Add a fast-path for `VarStack` nodes for when the inputs are scalars.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack`
Reviewed By: hlu1
Differential Revision: D33177498
fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473
This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041
Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
Reviewed By: donaldong
Differential Revision: D32320084
fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999
This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime).
split_with_sizes can be called directly from cpp API, or in `torch.split` when `split_sizes` is a list. This diff adds support for both use cases.
Test Plan:
- Added unit tests. Made sure the operators are used
- Benchmark
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \
--method_name=user.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \
--recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1
```
#### Before
```
Static runtime ms per iter: 3.62073. Iters per second: 276.187
0.0471904 ms. 1.31501%. aten::split_with_sizes (5 nodes)
```
#### After
```
Static runtime ms per iter: 3.44374. Iters per second: 290.382
0.0432057 ms. 1.34276%. aten::split_with_sizes (5 nodes, native)
```
Reviewed By: swolchok
Differential Revision: D33141006
fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750
There was some room for optimization in static runtime's `prim::VarStack`:
* Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor*>`
* Skip the memory overlap check
* Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`)
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`
Reviewed By: swolchok
Differential Revision: D32596934
fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691
TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.
This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, malfet
Differential Revision: D32596264
Pulled By: albanD
fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935
Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`.
ghstack-source-id: 145661358
Test Plan:
Added test for dtype bool.
Ran CMF local_ro net:
before:
```
I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67
I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11
I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93
I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57
I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76
I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45
I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4
I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43
I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04
I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8
I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367
```
after:
```
I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15
I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91
I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96
I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56
I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55
I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11
I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84
I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02
I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81
I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6
I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761
```
Reviewed By: hlu1
Differential Revision: D33106675
fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087
This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`:
1. Eliminate code duplication between rvalue/lvalue overloads
2. Add type checks
3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed.
4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.|
ghstack-source-id: 145599837
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32711705
fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69819
We should skip ReplaceWithCopy if the inputs to the operator can be updated during inference. For a set of tensors that share data, ReplaceWithCopy should not happen to any of them if there exists updates to any of them.
Currently, the check in place has missed some cases (suppose there exists updates, and uses <= 1). This diff addresses the missing cases by querying AliasDB.
Test Plan:
- Added test cases, including a one that is problematic before this diff
- CI
Reviewed By: mikeiovine
Differential Revision: D33052562
fbshipit-source-id: 61f87e471805f41d071a28212f2f457e8c6785e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69568
Non-empty vectors should never be passed to `assignStorageToManagedTensors` and `assignStorageToManagedOutputTensors`. Presumably, this out-variant convention was adopted to avoid move-assigning the corresponding attribtues in `MemoryPlanner`. But the cost of a vector move-assign is not high, and this function type signature is safer.
Test Plan: `buck test caffe2/bechmarks/static_runtime:static_runtime_cpptest`
Reviewed By: donaldong
Differential Revision: D32729289
fbshipit-source-id: 88f19de8eb89d8a4f1dd8bbd4d9e7f686e41888b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595
This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32908341
fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795
This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.
Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.
Test Plan:
**New unit tests**
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32609915
fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302
Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.
Test Plan:
## **Re-use metrics**
`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```
**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```
Reviewed By: hlu1
Differential Revision: D32370424
fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219
This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts.
Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change.
Reviewed By: mikeiovine
Differential Revision: D32544399
fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219
I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2.
ghstack-source-id: 144507287
Test Plan: Used during development of D31776259
Reviewed By: hlu1
Differential Revision: D31847566
fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639
Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas
fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)
Reviewed By: mikeiovine
Differential Revision: D32553708
fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284
Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.
Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: swolchok
Differential Revision: D32397207
fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368
Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.
However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.
This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.
This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.
Thanks to hlu1 for proposing this non-intrusive improvement idea :D
Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.
==AFTER
* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087
* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101
==BEFORE
* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18
* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453
==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:
==AFTER
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
==BEFORE
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
Reviewed By: swolchok
Differential Revision: D32337548
fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935
Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112
Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.
Previous diff, local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```
This diff, local:
```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)
Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.
Reviewed By: hlu1
Differential Revision: D32216573
fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934
This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113
Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.
Previous diff, CMF local:
```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```
This diff, CMF local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```
Perf looks neutral; see next diff (D32216573) test plan for details.
Reviewed By: hlu1
Differential Revision: D32190751
fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939
With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).
Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.
This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.
A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.
Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.
Reviewed By: swolchok
Differential Revision: D32219731
fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825
The comment explains how it works.
Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619
# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424
# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765
# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658
# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```
Reviewed By: d1jang
Differential Revision: D32124812
fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31994040
fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941
I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests.
From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in.
Test Plan:
Facebook
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
Reviewed By: swolchok
Differential Revision: D32196204
fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856
Returns a tensor constructed from scalar input
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Ran
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=*NumToTensorScalar* --v=1
```
and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`.
Reviewed By: mikeiovine
Differential Revision: D32014194
fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437
Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.
Test Plan: `buck test caffe2/test:jit -- test_freezing`
Reviewed By: hlu1
Differential Revision: D31980843
fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31992093
fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550
`aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations.
Some examples:
- Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this:
```
def forward(x):
y = x.detach()
return x is y
```
We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`.
- `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards
```
def forward(x):
y = x.to(x.dtype)
return x is y
```
- And many more, `FuseListUnpack` can break too
Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D32022584
fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31965159
fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31962589
fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255
Add an out variant for `aten::where`.
Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well.
Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: navahgar
Differential Revision: D31923886
fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001
The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.
This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
`buck test caffe2/test:static_runtime`
Reviewed By: hlu1
Differential Revision: D31821990
fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648
Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:
1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```
The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.
Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.
Reviewed By: hlu1
Differential Revision: D31659973
fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940
`aten::index`'s schema is as follows:
```
"aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
```
The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor.
Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`.
Reviewed By: hlu1
Differential Revision: D31712145
fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429
The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722
Test Plan:
CI
I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.
Reviewed By: hlu1
Differential Revision: D31096042
fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557
The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31611961
fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515
This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:
- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
- The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.
This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.
Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.
- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).
Reviewed By: hlu1
Differential Revision: D31049221
fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517
This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:
- input_aliases: values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.
`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.
Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.
Reviewed By: hlu1
Differential Revision: D30940955
fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161
`aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31406764
fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849
Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31282901
fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516
This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.
Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.
This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (3fb33b38b9)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.
This change fixes the root cause from `_embedding_bag_cpu_impl_out` by replacing alias creation with copying.
Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
1. profile run
2. run using the profile to allocate managed tensors
3. reuse the managed tensors -- newly added
Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.
Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).
- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.
Reviewed By: hlu1
Differential Revision: D31104345
fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499
When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027
Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).
Before:
I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971
mean: 7.144172, stddev: 0.1283
After:
I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85
mean: 6.800632, stddev: 0.013227
Looks like about a 5.3% improvement.
Reviewed By: hlu1
Differential Revision: D31125492
fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741
This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing.
Reviewed By: hlu1
Differential Revision: D31234944
fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.
Issue tracker: https://github.com/pytorch/pytorch/issues/55070
cc: ysiraichi ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805
Reviewed By: gchanan
Differential Revision: D31134454
Pulled By: ezyang
fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387
Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.
Also, added a SR microbenchmark for this kernel which shows the performance improvement.
Without fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 1953 ns 1953 ns 358746
BM_signed_log1p/64 2049 ns 2049 ns 342145
BM_signed_log1p/512 3291 ns 3291 ns 214342
BM_signed_log1p/4096 15559 ns 15559 ns 44420
BM_signed_log1p/32768 101936 ns 101935 ns 6843
BM_signed_log1p/65536 194792 ns 194789 ns 3615
```
With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 369 ns 369 ns 1896179
BM_signed_log1p/64 497 ns 497 ns 1406995
BM_signed_log1p/512 1618 ns 1618 ns 430209
BM_signed_log1p/4096 11327 ns 11326 ns 61463
BM_signed_log1p/32768 84099 ns 84086 ns 8325
BM_signed_log1p/65536 166531 ns 166510 ns 4186
```
This clearly shows >15% improvement in performance of this kernel with NNC fusion.
On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
with NNC fusion: `0.55%`
Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`
Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)
```
get 57220 prediction values
get 57220 prediction values
max_error: 0 total: 0
```
Reviewed By: hlu1
Differential Revision: D30609492
fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd