Commit Graph

119 Commits

Author SHA1 Message Date
Mike Iovine
a0495b3cdb [SR] Remove unused operator() overload (#67001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001

The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.

This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

`buck test caffe2/test:static_runtime`

Reviewed By: hlu1

Differential Revision: D31821990

fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
2021-10-25 08:18:58 -07:00
Mike Iovine
f2582a59d0 [SR] Add rvalue overload for operator() (#66648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648

Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:

1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```

The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.

Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.

Reviewed By: hlu1

Differential Revision: D31659973

fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
2021-10-22 10:51:47 -07:00
Aditya Pillai
40a8a50913 Add static_runtime::fused_equally_split (#2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881

Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator.

Test Plan:
```
adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s
❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
and sandcastle
strange_what_could_go_wrong

Reviewed By: mikeiovine

Differential Revision: D31742293

fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39
2021-10-22 10:26:49 -07:00
Don Jang
18bbc4c2b7 [Static Runtime] Fix a bug in aten::index (#66940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940

`aten::index`'s schema is as follows:

```
"aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
```

The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor.

Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`.

Reviewed By: hlu1

Differential Revision: D31712145

fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3
2021-10-20 15:51:21 -07:00
Hao Lu
6634570aef [SR] Fix bug in ValueGroup (#66470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470

Reviewed By: d1jang

Differential Revision: D31566348

fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3
2021-10-13 19:26:38 -07:00
Scott Wolchok
d30397d42a [PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429

The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722

Test Plan:
CI

I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.

Reviewed By: hlu1

Differential Revision: D31096042

fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
2021-10-13 19:13:20 -07:00
Mike Iovine
37db650c9c [Static Runtime] Clone test does not use uninitialized memory (#66557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557

The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31611961

fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8
2021-10-13 14:02:34 -07:00
Don Jang
736fa09a9a [Static Runtime] Manage output tensors (#65515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515

This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:

- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output  `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for  inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
-  The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.

This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.

Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.

- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).

Reviewed By: hlu1

Differential Revision: D31049221

fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
2021-10-11 09:50:54 -07:00
Don Jang
416f593080 [Static Runtime] Group graph nodes into input aliases & output aliases (#65517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517

This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:

- input_aliases:  values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.

`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.

Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.

Reviewed By: hlu1

Differential Revision: D30940955

fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
2021-10-07 14:35:12 -07:00
Mike Iovine
d5f64afc38 [Static Runtime] Support aten::to.prim_dtype overload (#64928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928

Added support this overload of `aten::to`:
```
aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a|b)
```

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to`

Reviewed By: hlu1

Differential Revision: D30901398

fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb
2021-10-07 10:22:44 -07:00
Mike Iovine
6d7fab5929 [Static Runtime][easy] Clone scripts do not use aten::add (#66161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161

`aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31406764

fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e
2021-10-06 12:37:39 -07:00
Mike Iovine
ed50fa2513 [Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849

Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31282901

fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9
2021-10-04 20:46:07 -07:00
Don Jang
89ed9bdaee [Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516

This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.

Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.

This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (3fb33b38b9)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.

This change fixes the root cause from `_embedding_bag_cpu_impl_out`  by replacing alias creation with copying.

Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
 1. profile run
 2. run using the profile to allocate managed tensors
 3. reuse the managed tensors -- newly added

Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.

Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).

- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.

Reviewed By: hlu1

Differential Revision: D31104345

fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f
2021-10-03 15:10:58 -07:00
Scott Wolchok
ffede499b2 [PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499

When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027

Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).

Before:

I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971

mean: 7.144172, stddev: 0.1283

After:

I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85

mean: 6.800632, stddev: 0.013227

Looks like about a 5.3% improvement.

Reviewed By: hlu1

Differential Revision: D31125492

fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
2021-10-01 12:13:33 -07:00
Mike Iovine
5f7ab7be6f [Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741

This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing.

Reviewed By: hlu1

Differential Revision: D31234944

fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4
2021-09-29 08:04:20 -07:00
Kushashwa Ravi Shrimali
4752453d27 [Structured Kernels] Port for baddbmm and bmm (#64805)
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805

Reviewed By: gchanan

Differential Revision: D31134454

Pulled By: ezyang

fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
2021-09-28 11:07:31 -07:00
Mike Iovine
ef9e560796 [Static Runtime] Add aten::remainder out variant (#64967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967

Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder`

Reviewed By: d1jang

Differential Revision: D30915469

fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9
2021-09-24 07:51:39 -07:00
Raghavan Raman
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
Hao Lu
ce101fed02 [PyPer] copy-free freeze_module (#65118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118

Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase.

Reviewed By: eellison, movefast1990

Differential Revision: D30955053

fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0
2021-09-20 17:25:10 -07:00
Mike Iovine
99e4ab5d44 [Static Runtime] Implement and enable variadic tuple unpack (#64934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934

Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```

The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`

Reviewed By: d1jang

Differential Revision: D30872109

fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
2021-09-20 10:36:11 -07:00
Don Jang
ae00075ac7 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123

This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.

```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57   return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57    auto sr = getStaticRuntime();
Sep 15 15:39:57         ^    ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 2 errors generated.
```

This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).

Original Summary:

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: - Confirm that OSS build went well (See External Tests section).

Reviewed By: mikeiovine

Differential Revision: D30983292

fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
2021-09-17 13:32:01 -07:00
Don Jang
8241193d76 [Static Runtime] Introduce static_runtime::dict_unpack (#64771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771

Test Plan:
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict`
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict`
- TBD: Perf impact measurement

Reviewed By: mikeiovine

Differential Revision: D30685083

fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9
2021-09-16 23:25:13 -07:00
Don Jang
3fb33b38b9 [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
2021-09-15 08:38:05 -07:00
Mike Iovine
616fd9219d [Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209

Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
    %0 : Tensor = aten::sign(%input)
    %1 : Tensor = aten::abs(%input)
    %2 : Tensor = aten::log1p(%1)
    %res : Tensor = aten::mul(%0, %2)
    return (%res)
```
Into a single op:
```
graph(%input):
    %res : Tensor = static_runtim::signed_log1p(%input)
    return (%res)
```

The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.

Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Test passed with new graph pass disabled and enabled.

Reviewed By: hlu1

Differential Revision: D30559929

fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
2021-09-02 08:31:40 -07:00
Ray Peng
09e610e36d [Static Runtime] Out version for softmax (#64243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243

Test Plan:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
...
V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1)
...
[       OK ] StaticRuntime.IndividualOps_Softmax (803 ms)
```

Reviewed By: hlu1

Differential Revision: D30656149

fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c
2021-08-31 18:33:26 -07:00
Harut Movsisyan
3c15822f5f [Static Runtime] Implement aten::nonzero out variant (#64126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126

Test Plan:
Confirm out variant is called:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30617729

fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b
2021-08-31 00:51:15 -07:00
Harut Movsisyan
1f16c22dc8 [Static Runtime] Implement aten::cumsum out variant (#64159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30622819

fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
2021-08-30 16:18:22 -07:00
Harut Movsisyan
e24c3644d8 [Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157

UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.

Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30598574

fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
2021-08-30 09:42:38 -07:00
Harut Movsisyan
8af1407eab [Static Runtime] Out version for torch.linalg.norm (#64070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30595816

fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
2021-08-29 21:00:11 -07:00
Don Jang
9f1f22b9bc [Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081

This change add an out variant of `quantized::embedding_bag_byte_prepack`.

Test Plan:
- Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`.

- Observed

```
V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input)
```

Reviewed By: hlu1

Differential Revision: D30504216

fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2
2021-08-27 10:53:23 -07:00
Harut Movsisyan
f2c47cf4db [Static Runtime] Out version for fmod (#64046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046

Test Plan:
Confirm out variant is used:
```
> //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1

V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1)
```

Reviewed By: mikeiovine

Differential Revision: D30581228

fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74
2021-08-27 03:05:06 -07:00
Don Jang
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
Don Jang
cbfec02007 [Static Runtime] Add native op for aten::expand_as (#64024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024

`aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime.

Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs`

Reviewed By: hlu1

Differential Revision: D30546851

fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0
2021-08-26 13:05:53 -07:00
Hao Lu
6fa646ad54 [StaticRuntime] Fix bug in HasInplaceOp (#63842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842

Reviewed By: mikeiovine

Differential Revision: D30506914

fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852
2021-08-24 17:07:45 -07:00
Mike Iovine
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
Don Jang
84890aae35 [Static Runtime] Add an out variant op for aten::abs (#63675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675

This change adds an out variant implementation for `aten::abs`.

Test Plan:
- Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)`

- Perf impact: TBD

Reviewed By: hlu1

Differential Revision: D30461317

fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191
2021-08-23 16:25:10 -07:00
Hao Lu
b2a601ffe5 [Static Runtime] Implement out variant for fb::quantized_linear (#63635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635

Reviewed By: ajyu

Differential Revision: D30446234

fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030
2021-08-20 21:42:22 -07:00
Don Jang
913c1f83f4 [Static Runtime] Add native op for aten::detach (#63625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625

This change adds a static runtime's native op implementation for `aten::detach` op.

See the standard  `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison.

Test Plan:
- Added `StaticRuntime.IndividualOps_Detach`.

- Observed

```
V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1)
```

Reviewed By: hlu1

Differential Revision: D30443187

fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17
2021-08-20 00:46:27 -07:00
Mike Iovine
47a9e8ff32 [Static Runtime] Support __getitem__ for lists (#63398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398

This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter.

Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30368464

fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764
2021-08-19 06:38:51 -07:00
Mike Iovine
9d9e7a8d72 [Static Runtime] Implement aten::append (#63350)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350

Add a native implementation for `aten::append`, the list append op.

Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append`

Reviewed By: hlu1

Differential Revision: D30326461

fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3
2021-08-17 13:40:18 -07:00
Mike Iovine
078b8004a6 [Static Runtime] Implement prim::TupleUnpack (#63243)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63243

Add `prim::TupleUnpack` native op to static runtime.

Test Plan: Unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30306955

fbshipit-source-id: 21923d6cbd5545c144ac051b3d48b37ec6e610cf
2021-08-16 14:56:30 -07:00
Mike Iovine
3dcd785cac [Static Runtime] Add tests for all aten ops (#62347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347

This diff includes tests for all `aten` ops that did not already have test coverage.

Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29968280

fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2
2021-08-09 12:09:59 -07:00
Rong Rong (AI Infra)
7f1b672b7a Revert D29952381: [Static Runtime] Ensure that unittests only use out variants or native ops
Test Plan: revert-hammer

Differential Revision:
D29952381 (8737e17af2)

Original commit changeset: e60e70b80ccf

fbshipit-source-id: 59dc2f920b7ceaf94ba8f5f36024e7cc710f6645
2021-08-04 14:25:11 -07:00
Don Jang
8737e17af2 [Static Runtime] Ensure that unittests only use out variants or native ops (#62335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335

This change ensures that unittests only use out variants or native ops.

- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.

- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.

- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op  = true` since they are written to use interpreter ops by design.

Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.

Reviewed By: mikeiovine, hlu1

Differential Revision: D29952381

fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
2021-08-04 11:37:15 -07:00
Mike Iovine
34f50c6e35 [Static Runtime] testStaticRuntime verifies that # of nodes is at least 2 (#62622)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622

This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used.

Reviewed By: hlu1

Differential Revision: D30058050

fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b
2021-08-03 15:55:57 -07:00
Raghavan Raman
b91a917616 [Static Runtime] Fixed another build failure in OSS due to test_utils.h (#62338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62338

Test Plan: Imported from OSS

Reviewed By: d1jang

Differential Revision: D29965744

Pulled By: navahgar

fbshipit-source-id: cf3e54ac13432ea8afc4b718fac6c9768743d01b
2021-07-28 11:41:33 -07:00
Don Jang
68efa186cc [static runtime] Implement aten::full (#62227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62227

Test Plan: Added `StaticRuntime.IndividualOps_Full` to cover the newly added code path.

Reviewed By: hlu1

Differential Revision: D29923649

fbshipit-source-id: 722950137c35ae325590a670b97f03b395e8eac3
2021-07-28 09:50:27 -07:00
Mike Iovine
e1bee3eb30 [Static Runtime] Add missing unit tests for static runtime ops (#62238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62238

Added tests for the following ops:

* `aten::mul`
* `aten::nan_to_num`
* `aten::stack`
* `aten::relu`
* `aten::tanh`

Reviewed By: hlu1

Differential Revision: D29914217

fbshipit-source-id: 6a6c39629310e7131127e24fdce7253ccdf80340
2021-07-27 14:12:21 -07:00
Raghavan Raman
60070982d2 [Static Runtime] Fixed build failure in OSS due to test_utils (#62216)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62216

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29917514

Pulled By: navahgar

fbshipit-source-id: 379863e6cd0b157de3bfa1482f5519b26654b3d2
2021-07-26 16:10:10 -07:00
Mike Iovine
6007ad3529 [Static Runtime] Refactor fb op tests to use testStaticRuntime (#62064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064

`testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators`

Reviewed By: hlu1

Differential Revision: D29858928

fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c
2021-07-26 08:25:10 -07:00