Commit Graph

211 Commits

Author SHA1 Message Date
mikeiovine
56c23f5633 [SR] Out variant for embedding_bag_byte_unpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661

Add an out variant and wrapper in static runtime.

I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup.

Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)!

Approved by: https://github.com/tenpercent
2022-05-25 23:24:11 +00:00
Natalia Gimelshein
225b037df8 port clamp.Tensor to structured (#77149)
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77149
Approved by: https://github.com/ezyang
2022-05-11 21:00:02 +00:00
Max Podkorytov
a41d4f27d7 [static-runtime] refactor out variant for aten::embedding_bag (#76207)
Differential Revision: D35767504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76207
Approved by: https://github.com/mikeiovine
2022-05-11 18:29:18 +00:00
mikeiovine
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
Mike Iovine
9e32cdeda6 [SR] Use at::DimVector in reshape_copy_out (#76473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473

Avoid some extra heap allocations by using DimVector
ghstack-source-id: 155569314

Test Plan: Existing unit tests

Reviewed By: navahgar, huiguoo

Differential Revision: D35972439

fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4
(cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)
2022-05-05 17:31:54 +00:00
Natalia Gimelshein
122798916f Port clamp_min and clamp_max to structured
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76874
Approved by: https://github.com/bdhirsh
2022-05-05 15:52:20 +00:00
Mike Iovine
cac2733af1 [SR] Codegen for aten::clamp (#76340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340

NNC kernel for `clamp` scalar case
ghstack-source-id: 155466507

Reviewed By: navahgar, huiguoo

Differential Revision: D35904019

fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f
(cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)
2022-05-04 23:08:49 +00:00
Ansha Yu
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
Yukio Siraichi
22a10ce513 Port cat kernel to structured kernels.
Tracking issue: #55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68640

Approved by: https://github.com/ezyang
2022-04-14 17:49:43 +00:00
Don Jang
85e163c56b [Static Runtime] Fix a bug that aten::full_like reuses a tensor that does not match arguments (#74255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255

This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.

Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.

Reviewed By: mikeiovine

Differential Revision: D34863639

fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
2022-04-05 22:30:41 +00:00
Raghavan Raman
60bda4d06b [Static Runtime] Fix handling relu in quantized linear relu dynamic op
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.

At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.

This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```

Reviewed By: mikeiovine

Differential Revision: D35366309

fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
2022-04-04 22:16:22 +00:00
Mike Iovine
688039859f [PyTorch][Static Runtime] out variant for where.self (#73438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73438

Add out variant for `where.self`; requires PyTorch core changes as no out variant existed previously
ghstack-source-id: 151505601

Test Plan:
* Existing `where` tests in static runtime pass
* CI for core `where` tests

Reviewed By: hlu1

Differential Revision: D34469785

fbshipit-source-id: 8a4ebbf38b2364534fbf43812bfcfdf69ea174b3
(cherry picked from commit d3faf61f408a385d67b5b821dfaf084a8e713f30)
2022-03-17 00:14:11 +00:00
Don Jang
6294a2eb7f [Static Runtime] Add out variant wrapper for aten::index_select (#74321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321

This change adds out variant wrapper for aten::index_select.

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34928012

fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283
(cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)
2022-03-16 23:43:21 +00:00
Mike Iovine
f14a0be302 [SR] Avoid allocating rstd/mean in layer_norm (#73606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606

The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116

Test Plan: Existing unit tests

Reviewed By: hlu1

Differential Revision: D34562131

fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
2022-03-15 22:07:11 +00:00
Don Jang
381c0c080f [Static Runtime] Fix a bug that aten::full reuses a tensor that does not match requested one (#73990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990

This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.

This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.

Test Plan: - Added a unittest.

Reviewed By: mikeiovine

Differential Revision: D34768718

fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
2022-03-15 16:13:52 +00:00
Don Jang
1b80f609b0 [Static Runtime] Add out variant wrapper for aten::ones_like (#73945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945

This change adds add out variant wrapper for aten::ones_like.

Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).

Reviewed By: hlu1

Differential Revision: D34727057

fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
2022-03-14 20:29:58 +00:00
Don Jang
60f22a40ef [Static Runtime] Add out variant wrapper for aten::zeros (#73946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946

This change adds an out variant wrapper for aten::zeros.

Test Plan:
- Added a unittest.

- Confirmed that the added out variant gets executed by the unittest (P485324923).

Reviewed By: mikeiovine

Differential Revision: D34725843

fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
2022-03-11 00:52:30 +00:00
Mike Iovine
97b20b9b50 [SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704

Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D34596954

fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
2022-03-09 00:55:57 +00:00
Don Jang
71961d37bb [Static Runtime] Add out variant wrapper for aten::ones (#73851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851

This change adds an out variant wrapper for aten::ones

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34557095

fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
2022-03-07 20:33:22 +00:00
Don Jang
c62de0ac15 [Static Runtime] [Code Cleanup] Use SROperator for operators' function type (#73450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450

This change uses `SROperator` for operators' function type

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D34483246

fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
2022-03-01 02:30:48 +00:00
Mike Iovine
d398d4d32c [SR] Disable aten::where out variant (#73367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367

The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others:

```
def forward(self, cond_1d, x, y):
    shape = [-1] + [1] * (x.dim() - 1)
    cond = cond_1d.view(shape)
    return torch.where(cond, x, y).clone()

Condition:
01
00
[ CPUBoolType{2} ]

A:
06 -9
08 -8
[ CPULongType{2,2} ]

B:
-4 05
-5 -2
[ CPULongType{2,2} ]

Actual:
06 05
-5 -2
[ CPULongType{2,2} ]

Expected:
06 -9
-5 -2
[ CPULongType{2,2} ]
```
ghstack-source-id: 149963254

Test Plan: Unit tests exercise broadcasting

Reviewed By: d1jang

Differential Revision: D34454770

fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4
(cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)
2022-02-26 01:08:45 +00:00
Don Jang
5772b1afbc [Static Runtime] Avoid checks during op execution for TupleConstruct & ListConstruct (#69029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69029

This change optimizes the execution of `prim::TupleConstruct` & `prim::ListConstruct` by performing case analysis at op loading time, not op execution time.

Test Plan:
- Existing unittests

- Ran inline_cvr nets via ptvsc2_predictor_bench with compare_result=1

Reviewed By: swolchok

Differential Revision: D32518670

fbshipit-source-id: 575b29b06eadf77ba9f1be306119fa194d4f21bf
(cherry picked from commit 88cc2253b927267cad33063284e9cc66e0d31e2f)
2022-02-24 16:38:55 +00:00
Raghavan Raman
a7f9e610af [Static Runtime] Adding out-variant support for quantized::linear_relu_dynamic_fp16 (#73238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73238

ghstack-source-id: 149706142

Test Plan:
Tested on video model.
Profile before:
```
4.36922 ms.    17.1047%. quantized::linear_relu_dynamic_fp16 (14 nodes)
```
Profile after:
```
3.80852 ms.    17.1074%. quantized::linear_relu_dynamic_fp16 (14 nodes, out variant)
```

Reviewed By: mikeiovine

Differential Revision: D34287961

fbshipit-source-id: b88e2f3432215eac14fd36f945a4810d29ba1051
(cherry picked from commit 076a766ab471c362af2f2ee3b55fe75829f5e955)
2022-02-23 18:33:46 +00:00
Mike Iovine
6f84c5f0b9 [SR] Generalize VarStackNodeWrapper (#71573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71573

Many ops (`gather_ranges_to_dense`, `sigrid_transforms`, etc) are implemented like this:

```
void op_out_(std::vector<Tensor>& output) {
  // actual op implementation
}

std::vector<Tensor> op() {
  std::vector<Tensor> outputs;
  // populate outputs with empty tensors
  op_out_(outputs)
  return outputs;
}
```

This pattern is not ideal for ops that are fused with `ListUnpack` - it would be better if we wrote to the outputs directly.

This diff extends the ideas from `VarStackNodeWrapper` to allow for this. The changes are:

* `s/VarStackNodeWrapper/ProcessedNodeInputWrapper`. The old name was bad because the class is more general than the `VarStack` use case. Also moved the class to `processed_node_wrapper.h`
* Add a `ProcessedNodeOutputWrapper`; it's essentially the same idea as `ProcessedNodeInputWrapper`, but it allows non-const access to the underlying tensors.
* These classes are very similar, so CRTP is used to facilitate code re-use
ghstack-source-id: 148825800

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

Reviewed By: swolchok

Differential Revision: D33687965

fbshipit-source-id: 5fa0107211116867bb2b63968c126550d32fbea6
(cherry picked from commit 75c263d960)
2022-02-10 19:43:47 +00:00
Scott Wolchok
958f9cf5ff [PyTorch][Static Runtime] Fix extra refcount bumps in layer_norm (#71237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71237

Noticed these on inspection.
ghstack-source-id: 147171799

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D33519799

fbshipit-source-id: 167c63323b345a5822303cecdbbbbb959f66f6e4
(cherry picked from commit 57e8da2d35)
2022-01-20 00:16:17 +00:00
Scott Wolchok
bf82d2012e [PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247

Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314

Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?

Reviewed By: mikeiovine

Differential Revision: D33556198

fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
2022-01-14 14:32:40 -08:00
Elias Ellison
c8332256ee [JIT] Refactor SR invocation of fusion (#70508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70508

We can create the code object at compile time instead or runtime to speed it up. This also makes unnecessary the compilation cache. TODO: figure out if theres a way to cache InterpreterState object

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33458648

Pulled By: eellison

fbshipit-source-id: 710389741e7c6210528f2f96ab496fcd533d942a
2022-01-10 12:16:35 -08:00
Mike Iovine
9ad21091dd [SR] Give VarStackNodeWrapper an iterator (#69922)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69922

D32596934 (65f54bc000) made the serial stack implementation a bit brittle. It introduced a new container type: `VarStackNodeWrapper`. This type was used as a template parameter in the serial stack implementation.

The other type used in the serial stack implementation is `at::ArrayRef<at::Tensor>`. Ideally, the interface of `VarStackNodeWrapper` should be as close as possible to this other type. However, because the new container type did not have an iterator, expressions like this would fail to compile:
```
for (const auto& tensor : tensors) {
  // do something
}
```
Introducing this iterator will make the code easier to maintain going forward.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

I consider this a `VarStack` implementation detail, so I'd prefer not to test it directly. We can test it implicitly by adding some code to the serial stack implementation that uses the iterator.

Reviewed By: swolchok

Differential Revision: D33101489

fbshipit-source-id: 7cf44c072d230c41bd9113cf2393bc6a6645a5b5
2022-01-07 07:24:47 -08:00
Scott Wolchok
4d8fc8693c [PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223

ghstack-source-id: 146482215

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

Reviewed By: hlu1

Differential Revision: D31776259

fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
2022-01-04 22:36:10 -08:00
Scott Wolchok
99a10c371f [PyTorch][Static Runtime] Fix dtype changing between iterations for to() (#67394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394

ghstack-source-id: 146464294

Test Plan:
Added new test, which failed but now passes.

Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral

```
Stable, local
========================================

I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1
I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081
I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636
I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557
I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982
I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795
I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164
0

FixToDtypeChanges, local
========================================
I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66
I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257
I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847
I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632
I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787
I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706
```

Reviewed By: hlu1

Differential Revision: D31972722

fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15
2022-01-04 22:34:49 -08:00
Mike Iovine
6a84449290 [SR] Fast path for VarStack on scalars (#70210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210

Add a fast-path for `VarStack` nodes for when the inputs are scalars.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack`

Reviewed By: hlu1

Differential Revision: D33177498

fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6
2021-12-23 10:31:17 -08:00
Raghavan Raman
633f770c3c [StaticRuntime] Add out-variant support for TensorExprDynamicGroup op (#69479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69479

This diff adds support for out-variant optimization for `TensorExprDynamicGroup` op, which will be used for TensorExpr based fusion in Static Runtime.
ghstack-source-id: 146107008

Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Completed accuracy test on inline_cvr model 294738512 v0. Results:
```
get 1012 prediction values
get 1012 prediction values
pyper_inference_e2e_local_replayer_test.out.132ea03c2 pyper_inference_e2e_local_replayer_test.out.1858bbeb0
max_error:  0 % total:  0
```

Reviewed By: d1jang, mikeiovine

Differential Revision: D32768463

fbshipit-source-id: a3e6c1ea9ff5f3b57eb89095aa79a6d426fbb52a
2021-12-22 00:30:22 -08:00
Mike Iovine
65f54bc000 [SR] Optimize VarStack (#68750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750

There was some room for optimization in static runtime's `prim::VarStack`:

* Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor*>`

* Skip the memory overlap check

* Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`)

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

Reviewed By: swolchok

Differential Revision: D32596934

fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0
2021-12-20 11:46:11 -08:00
Scott Wolchok
66406ee0f7 [PyTorch][Static Runtime] Fix to() w/dtype bool (#69935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935

Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`.
ghstack-source-id: 145661358

Test Plan:
Added test for dtype bool.

Ran CMF local_ro net:

before:

```
I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67
I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11
I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93
I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57
I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76
I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45
I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4
I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43
I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04
I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8
I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367
```

after:
```
I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15
I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91
I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96
I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56
I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55
I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11
I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84
I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02
I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81
I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6
I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761
```

Reviewed By: hlu1

Differential Revision: D33106675

fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862
2021-12-15 15:26:56 -08:00
Mike Iovine
102684b252 [SR] Fix stack/concat bug (#68777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68777

Fixed some cases where negative dimensions were not handled correctly

* `_stack_cpu` calls `maybe_wrap_dim`, but `_stack_cpu_out` does not. This is only problematic when `_stack_cpu_out` forwards to the serial kernel: [ref](https://www.internalfb.com/code/fbsource/[1b5af978b48f2e5d308d42b588bde3275869a57b]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=1541-1547).
* concat also needs to wrap its dim

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Added new tests to cover this case

Reviewed By: hlu1

Differential Revision: D32604623

fbshipit-source-id: 00aaa42817cd2d3e7606ce75ab5a9744645118cf
2021-12-14 16:26:27 -08:00
Don Jang
c97dc9286d Revert D32780415: [Static Runtime] Move implementation details from impl.h into internal.h
Test Plan: revert-hammer

Differential Revision:
D32780415 (999e93e6a8)

Original commit changeset: 119b7aedbf56

fbshipit-source-id: 1aa777e8c1854ab27e86bc625188f7170097fac8
2021-12-04 19:44:07 -08:00
Don Jang
999e93e6a8 [Static Runtime] Move implementation details from impl.h into internal.h (#69274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274

`impl.h` is the main header file that defines the interface of Static Runtime to its clients.

However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.

To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.

This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.

Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.

Test Plan: Existing unittests

Reviewed By: donaldong

Differential Revision: D32780415

fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
2021-12-04 14:48:28 -08:00
Scott Wolchok
14ed4df305 [PyTorch][Static Runtime][easy] give to_copy_functor a name (#67701)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67701

I split this out to ease rebasing and review.
ghstack-source-id: 144507288

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D32112523

fbshipit-source-id: dba14e6ada33df02dbcd7025b090a8a18cf438ae
2021-12-02 10:36:26 -08:00
Don Jang
cd3e37cbe4 [Static Runtime] [Code Cleanup] Reduce indentation depth in ops.cpp (#69028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69028

This change converts

```
if (..) {
 ...
} else {
 ...
}
# end of function
```

into

```
if(...) {
  ...
  return;
}
...
```
in ops.cpp to remove the else branch to reduce the indentation depth by 1 for better readability.

Test Plan: N/A

Reviewed By: hlu1

Differential Revision: D32506235

fbshipit-source-id: a4fd5188bd680dba5dcad2b6e873735a54497664
2021-11-30 09:41:46 -08:00
Ansha Yu
7342b654a1 [static runtime] dequantize out variant (#68664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68664

Reland D32187063 (f120335643), fixing lint
Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Test on ctr_mbl_feed model 307210374 on 696 inputs

Before:
0.0569016 ms.   0.296647%. aten::dequantize (10 nodes)

After:
0.0423128 ms.   0.220481%. static_runtime::dequantize_copy (10 nodes, out variant)

Reviewed By: mikeiovine

Differential Revision: D32566429

fbshipit-source-id: b95dfc4c5e4115e083794093bc1571c7b1d72f5b
2021-11-30 09:03:26 -08:00
Alban Desmaison
748d9d2494 Revert D32187063: [static runtime] dequantize out variant
Test Plan: revert-hammer

Differential Revision:
D32187063 (f120335643)

Original commit changeset: 1fec6b74c7d3

fbshipit-source-id: 9770f8379e9ddba9e537fef0e66cc93c2caaf860
2021-11-18 10:12:31 -08:00
Ansha Yu
f120335643 [static runtime] dequantize out variant (#67873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67873

Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Reviewed By: hlu1

Differential Revision: D32187063

fbshipit-source-id: 1fec6b74c7d3f25d0f445775c4558d30c55dcece
2021-11-18 09:31:27 -08:00
Scott Wolchok
6acde23bec [PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934

This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113

Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.

Previous diff, CMF local:

```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```

This diff, CMF local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```

Perf looks neutral; see next diff (D32216573) test plan for details.

Reviewed By: hlu1

Differential Revision: D32190751

fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
2021-11-16 10:19:50 -08:00
David Berard
b546cdf401 [SR] Out variant for prim::NumToTensor (#67856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856

Returns a tensor constructed from scalar input

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Ran
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=*NumToTensorScalar* --v=1
```
and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`.

Reviewed By: mikeiovine

Differential Revision: D32014194

fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542
2021-11-08 09:02:58 -08:00
Bin Wen
1baed45c6b [fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663

mostly follow the example of quantized::linear (D28428734 (4d7abdbdad)) to enable out-variant for quantized::linear_dynamic_fp16.

Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d

Test Plan:
buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench

  sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable

  MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 | pastry

before: P465201159

  0.929067 ms.     31.808%. quantized::linear_dynamic_fp16 (16 nodes)
  0.921679 ms.    31.7324%. quantized::linear_dynamic_fp16 (16 nodes)
  0.919127 ms.    31.7404%. quantized::linear_dynamic_fp16 (16 nodes)

after: P465203015

  0.90898 ms.    31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.9127 ms.      30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.879148 ms.    31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant)

unit test logic refers https://fburl.com/code/vv0rry13

  buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D32001168

fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3
2021-11-03 22:39:04 -07:00
Scott Wolchok
7cd62621fb [PyTorch] Adopt faster Tuple::create (#65381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65381

The previous diff adds a way to make Tuples of size 3 or less
more efficiently. This diff makes it easier to hit that path and
updates a bunch of callsites to hit it.
ghstack-source-id: 142065832

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D31069538

fbshipit-source-id: d04da3709594ed68ab1c0a1471f8cffd8d001628
2021-11-02 10:10:31 -07:00
Mike Iovine
7da9c4ed2e [SR] NNC out variant for aten::where (#67255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255

Add an out variant for `aten::where`.

Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well.

Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: navahgar

Differential Revision: D31923886

fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837
2021-10-28 06:48:22 -07:00
Hao Lu
9ebc6357b3 [SR] Vectorize int version of fmod (#67313)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313

Reviewed By: swolchok

Differential Revision: D31889868

fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd
2021-10-27 17:02:53 -07:00
Mike Iovine
d5f64afc38 [Static Runtime] Support aten::to.prim_dtype overload (#64928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928

Added support this overload of `aten::to`:
```
aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a|b)
```

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to`

Reviewed By: hlu1

Differential Revision: D30901398

fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb
2021-10-07 10:22:44 -07:00
Scott Wolchok
ffede499b2 [PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499

When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027

Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).

Before:

I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971

mean: 7.144172, stddev: 0.1283

After:

I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85

mean: 6.800632, stddev: 0.013227

Looks like about a 5.3% improvement.

Reviewed By: hlu1

Differential Revision: D31125492

fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
2021-10-01 12:13:33 -07:00