Commit Graph

163 Commits

Author SHA1 Message Date
Mike Iovine
d5f64afc38 [Static Runtime] Support aten::to.prim_dtype overload (#64928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928

Added support this overload of `aten::to`:
```
aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a|b)
```

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to`

Reviewed By: hlu1

Differential Revision: D30901398

fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb
2021-10-07 10:22:44 -07:00
Scott Wolchok
ffede499b2 [PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499

When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027

Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).

Before:

I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971

mean: 7.144172, stddev: 0.1283

After:

I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85

mean: 6.800632, stddev: 0.013227

Looks like about a 5.3% improvement.

Reviewed By: hlu1

Differential Revision: D31125492

fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
2021-10-01 12:13:33 -07:00
Mike Iovine
c975ca4337 [Static Runtime] Simplify out variant overload implementations (#65384)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65384

The following pattern appears frequently in `ops.cpp`:

```
if (!n->matches(schema_1) && !n->matches(schema_2) && ... && !n->matches(schema_n)) {
    LogAndDumpSchema(n);
    return nullptr;
}

return [](ProcessedNode* p_node) {
    if (p_node->Output(0).isNone()) {
        if (p_node->Input(i).isSomeType()) {
            // special logic for schema 1
        } else if (p_node->Input(i).isSomeOtherType()) {
            // special logic for schema 2
        } else if (...) {
            // special logic for schema3
        }
        // and so on
    } else {
        // another complicated type checking chain
    }
};
```

A much cleaner way to implement operator overloads is like this:
```
if (n->matches(schema_1)) {
    return schema_1_impl;
} else if (n->matches(schema_2)) {
    return schema_2_impl;
}
// and so on
```

This has a few advantages:
* Significantly reduces complexity of the out variant implementations, especially for ops with more than 2 overloads. One implementation corresponds to one schema. This makes the implementation more readable/maintainable.
* Adhering to this convention makes it easier to add a new overload. Just add a new `n->matches(...)` case instead of working the schema into existing complicated logic.
* Ops are marginally faster since we don't have to check types at runtime.

Note: there are a few cases where this actually made the code less concise (`aten::div`), so I left those ops untouched.

Thanks for pointing this out in another diff d1jang

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31072328

fbshipit-source-id: c40a4f7e6a79881e94c9ec49e9008ed75cfc8688
2021-09-29 12:02:11 -07:00
Kushashwa Ravi Shrimali
4752453d27 [Structured Kernels] Port for baddbmm and bmm (#64805)
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805

Reviewed By: gchanan

Differential Revision: D31134454

Pulled By: ezyang

fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
2021-09-28 11:07:31 -07:00
Mike Iovine
ef9e560796 [Static Runtime] Add aten::remainder out variant (#64967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967

Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder`

Reviewed By: d1jang

Differential Revision: D30915469

fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9
2021-09-24 07:51:39 -07:00
Raghavan Raman
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
Harut Movsisyan
9ad75281f6 [Static Runtime] Fix resize_output_check warning coming from prim::VarConcat (#64765)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64765

Test Plan: Tested the fix with BR v1 model predictor-replayer setup.

Reviewed By: ajyu

Differential Revision: D30846506

fbshipit-source-id: 3ef3c93f11285c7cd1e2b188ca298a7ab4fba579
2021-09-09 14:38:50 -07:00
Mike Iovine
616fd9219d [Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209

Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
    %0 : Tensor = aten::sign(%input)
    %1 : Tensor = aten::abs(%input)
    %2 : Tensor = aten::log1p(%1)
    %res : Tensor = aten::mul(%0, %2)
    return (%res)
```
Into a single op:
```
graph(%input):
    %res : Tensor = static_runtim::signed_log1p(%input)
    return (%res)
```

The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.

Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Test passed with new graph pass disabled and enabled.

Reviewed By: hlu1

Differential Revision: D30559929

fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
2021-09-02 08:31:40 -07:00
Ray Peng
09e610e36d [Static Runtime] Out version for softmax (#64243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243

Test Plan:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
...
V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1)
...
[       OK ] StaticRuntime.IndividualOps_Softmax (803 ms)
```

Reviewed By: hlu1

Differential Revision: D30656149

fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c
2021-08-31 18:33:26 -07:00
Harut Movsisyan
3c15822f5f [Static Runtime] Implement aten::nonzero out variant (#64126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126

Test Plan:
Confirm out variant is called:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30617729

fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b
2021-08-31 00:51:15 -07:00
Harut Movsisyan
1f16c22dc8 [Static Runtime] Implement aten::cumsum out variant (#64159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30622819

fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
2021-08-30 16:18:22 -07:00
Harut Movsisyan
e24c3644d8 [Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157

UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.

Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30598574

fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
2021-08-30 09:42:38 -07:00
Harut Movsisyan
8af1407eab [Static Runtime] Out version for torch.linalg.norm (#64070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30595816

fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
2021-08-29 21:00:11 -07:00
Mike Iovine
07c5cb8c48 [Static Runtime] Optimize memory planner initialization (#64101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101

Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.

There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.

Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: movefast1990

Differential Revision: D30595579

fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
2021-08-27 17:40:43 -07:00
Don Jang
9f1f22b9bc [Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081

This change add an out variant of `quantized::embedding_bag_byte_prepack`.

Test Plan:
- Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`.

- Observed

```
V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input)
```

Reviewed By: hlu1

Differential Revision: D30504216

fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2
2021-08-27 10:53:23 -07:00
Harut Movsisyan
f2c47cf4db [Static Runtime] Out version for fmod (#64046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046

Test Plan:
Confirm out variant is used:
```
> //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1

V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1)
```

Reviewed By: mikeiovine

Differential Revision: D30581228

fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74
2021-08-27 03:05:06 -07:00
Don Jang
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
Hao Lu
3c3bba4169 [Static Runtime] Use F14FastMap/F14FastSet (#63999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63999

Use folly::F14FastMap/F14FastSet instead of std::unordered_map/unordered_set in the Static Runtime code base. folly::F14FastMap/F14FastSet implements the same APIs as std::unordered_map/unordered_set but faster. For details see https://github.com/facebook/folly/blob/master/folly/container/F14.md

Reviewed By: d1jang

Differential Revision: D30566149

fbshipit-source-id: 20a7fa2519e4dde96fb3fc61ef6c92bf6d759383
2021-08-27 01:40:41 -07:00
Ansha Yu
3f1c809470 [static runtime] port c2 argmin kernel (#63632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63632

Local benchmarking with 1 input repeated 10k iter on 290331537_4 local net. Reduces argmin runtime by about 80% and and local net execution by about ~0.71-0.77ms.

Before:
```
I0826 17:25:53.972786 1104614 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 7.37599. Iters per second: 135.57
```
```
Static runtime ms per iter: 8.22086. Iters per second: 121.642
Time per node type:
        4.13527 ms.    50.9157%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.868506 ms.    10.6935%. aten::argmin (1 nodes, out variant)
...
```

After:
```
I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987
```
```
Static runtime ms per iter: 7.68172. Iters per second: 130.179
Time per node type:
         4.1452 ms.    54.0612%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.656778 ms.    8.56562%. fb::quantized_linear (8 nodes)
       0.488229 ms.    6.36741%. static_runtime::to_copy (827 nodes, out variant)
       0.372678 ms.    4.86042%. aten::argmin (1 nodes, out variant)
...Time per node type:
        3.39387 ms.    53.5467%. fb::sigrid_transforms_torch_bind (1 nodes, out variant)
       0.636216 ms.    10.0379%. fb::quantized_linear (8 nodes, out variant)
       0.410535 ms.    6.47721%. fb::clip_ranges_to_gather_to_offsets (304 nodes, out variant)
       0.212721 ms.     3.3562%. fb::clip_ranges_gather_sigrid_hash_precompute_v3 (157 nodes, out variant)
       0.173736 ms.    2.74111%. aten::matmul (1 nodes, out variant)
       0.150514 ms.    2.37474%. aten::argmin (1 nodes, out variant)
```
P447422384

Test Plan:
Test with local replayer sending traffic to `ansha_perf_test_0819.test`, and compare outputs to jit interpreter.

Start compute tier:
```
RUN_UUID=ansha_perf_test_0819.test.storage JOB_EXPIRE_TIME=864000 MODEL_ID=290331537_4 PREDICTOR_TAG= PREDICTOR_VERSION=405 PREDICTOR_TYPE=CPU ADDITIONAL_FLAGS="--enable_disagg_file_split=true --enable_adx=false --load_remote_file_locally=true --pytorch_predictor_static_runtime_whitelist_by_id=290331537" GFLAGS_CONFIG_PATH=sigrid/predictor/gflags/predictor_gflags_ads_perf_cpu_pyper SMC_TIER_NAME=sigrid.predictor.perf.ansha_per_test_0819.test.storage CLUSTER=tsp_rva ENTITLEMENT_NAME=ads_ranking_infra_test_t6 PREDICTOR_LOCAL_DIRECTORY= ICET_CONFIG_PATH= NNPI_COMPILATION_CONFIG_FILE= NUM_TASKS=1 NNPI_NUM_WORKERS=0 tw job start /data/users/ansha/fbsource/fbcode/tupperware/config/admarket/sigrid/predictor/predictor_perf_canary.tw
```

Start nnpi tier:
```
RUN_UUID=ansha_perf_test_0819.test JOB_EXPIRE_TIME=247200 MODEL_ID=290331537_4 PREDICTOR_TAG= PREDICTOR_VERSION=343 PREDICTOR_TYPE=NNPI_TWSHARED ADDITIONAL_FLAGS="--torch_glow_min_fusion_group_size=30 --pytorch_storage_tier_replayer_sr_connection_options=overall_timeout:1000000,processing_timeout:1000000 --predictor_storage_smc_tier=sigrid.predictor.perf.ansha_perf_test_0819.test.storage --pytorch_predictor_static_runtime_whitelist_by_id=290331537" GFLAGS_CONFIG_PATH=sigrid/predictor/gflags/predictor_gflags_ads_perf_glow_nnpi_pyper_v1 SMC_TIER_NAME=sigrid.predictor.perf.ansha_perf_test_0819.test CLUSTER=tsp_rva ENTITLEMENT_NAME=ads_ranking_infra_test_t17 PREDICTOR_LOCAL_DIRECTORY= ICET_CONFIG_PATH= NNPI_COMPILATION_CONFIG_FILE= NUM_TASKS=1 NNPI_NUM_WORKERS=0 tw job start /data/users/ansha/fbsource/fbcode/tupperware/config/admarket/sigrid/predictor/predictor_perf_canary.tw
```

```buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.IndividualOps_Argmin --print-passing-details```

Compared outputs to jit interpreter to check for no differences greater than 1e-3 (with nnc on) https://www.internalfb.com/intern/diff/view-version/136824794/

Reviewed By: hlu1

Differential Revision: D30445635

fbshipit-source-id: 048de8867ac72f764132295d1ebfa843cde2fa27
2021-08-26 23:19:19 -07:00
Don Jang
fbe7133b58 [Static Runtime] Disable out variant of aten::clone (#63980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63980

The out variant implementation of `aten::clone` causes a crash, which needs further investigation. This change disables it until the problem gets fixed.

Note that `inline_cvr` doesn't use `aten::clone` as of now, so no perf implication: https://www.internalfb.com/phabricator/paste/view/P446858755?lines=121

Test Plan: N/A

Reviewed By: hlu1

Differential Revision: D30544149

fbshipit-source-id: facb334d67473f622b36862fbdb2633358556fdf
2021-08-26 08:10:13 -07:00
Raghavan Raman
dde07cad6f [Static Runtime] Added a variable for clamp in the NNC code for Logit. (#63839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63839

Replaced the use of a constant for clamp in the NNC code for Logit
with a variable. This makes it easier to enable caching for Logit.

There is no performance difference with this change, as shown in the micro-benchmarks below.

```
Logit NNC Benchmark	Time (ns)
	           const-clamp	var-clamp
logit_nnc_sleef/64	550	543
logit_nnc_sleef/512	3514	3517
logit_nnc_sleef/8192	85537	82900
logit_nnc_sleef/32768	347635	337016
logit_nnc_fast/64	173	167
logit_nnc_fast/512	829	866
logit_nnc_fast/8192	13286	13069
logit_nnc_fast/32768	51116	53429
logit_nnc_vml/64	146	164
logit_nnc_vml/512	773	783
logit_nnc_vml/8192	11556	11563
logit_nnc_vml/32768	44815	46720
```

Test Plan: SR unit tests and the inline_cvr model.

Reviewed By: bertmaher

Differential Revision: D30405466

fbshipit-source-id: adb891fdae5746439931ce5f43165291fec08f52
2021-08-25 11:19:55 -07:00
Raghavan Raman
a2399a76e1 [Static Runtime] Moved NNC operator definitions to separate files. (#63838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63838

Refactored NNC operator definitions code into separate files.

Made `TEWrapper` a class with a fixed set of methods and added separate definitions for them based on `TORCH_ENABLE_LLVM` to keep the same functionality as before.

Test Plan: Build and ran Static Runtime tests.

Reviewed By: hlu1

Differential Revision: D30405467

fbshipit-source-id: 606ef852bb820d5e23a0f8af1bf5dc122e90bceb
2021-08-25 11:18:32 -07:00
Mike Iovine
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
Mike Iovine
1385f9fb12 [JIT] Add variadic stack op (#63578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63578

Added a new op `prim::VarStack` and a pass that transforms instances of `aten::stack(list, dim)` into `prim::VarStack(list[0], ..., list[n], dim)`. Also provided a JIT interpreter implementation.

Most of the implementation/tests are the same as `prim::VarConcat`.

Test Plan: `buck test caffe2/test/cpp/jit:jit -- TestStackOpt`

Reviewed By: navahgar

Differential Revision: D30426232

fbshipit-source-id: 9829a7db6e0a5038c9b7528c43c25b0c221aa2ce
2021-08-24 08:20:54 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Don Jang
84890aae35 [Static Runtime] Add an out variant op for aten::abs (#63675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675

This change adds an out variant implementation for `aten::abs`.

Test Plan:
- Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)`

- Perf impact: TBD

Reviewed By: hlu1

Differential Revision: D30461317

fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191
2021-08-23 16:25:10 -07:00
Hao Lu
b2a601ffe5 [Static Runtime] Implement out variant for fb::quantized_linear (#63635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635

Reviewed By: ajyu

Differential Revision: D30446234

fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030
2021-08-20 21:42:22 -07:00
Mikhail Zolotukhin
1dc2b52764 [TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
 * `Add*` --> `AddPtr`
 * `new Add(...)` --> `alloc<Add>(...)`
 * `dynamic_cast<Add*>` --> `to<Add>`
 * `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
 * `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
2021-08-17 13:44:45 -07:00
Yukio Siraichi
32b6104f37 Port norm kernel to structured kernels. (#62711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62711

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D30109866

Pulled By: ezyang

fbshipit-source-id: 894c9496894d059c7690a174b75bbd4db7ed6016
2021-08-13 08:27:48 -07:00
Raghavan Raman
8b54b14f92 [Static Runtime] Added a cache for NNC generated code across different calls to the same ops (#62921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62921

Added a cache for NNC generated code across different calls to the same ops.

Before this diff:
```
ProcessedNode time 13402.9 ms
Static Module initialization took 30964.8 ms
```

After this diff:
```
ProcessedNode time 85.4195 ms
Static Module initialization took 4348.42 ms
```

There is one global cache for all the ops. It is guarded with a reader-writer lock. This is necessary because we could have multiple threads loading different models in parallel. Note that this locking does not guarantee that there will be exactly one code generated for each op. There could be more than one thread generating code for the same op simultaneously and all of them will update the cache in some order. But that should be small number bounded by the number of threads. Also, there is no correctness issue, since the generated code is always the same and the one generated by the last thread is retained in the cache and reused later while running the model.

Test Plan: Tested inline_cvr model

Reviewed By: hlu1

Differential Revision: D30104017

fbshipit-source-id: 32e9af43d7e724ed54b661dfe58a73a14e443ff7
2021-08-09 09:30:07 -07:00
Yukio Siraichi
4c4c5b14e4 Port sum.dim_IntList kernel to structured kernels. (#61642)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61642

Tracking issue: #55070

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D29783865

Pulled By: ezyang

fbshipit-source-id: 375d4cd5f915812108367601a610a428762e606d
2021-08-09 08:46:16 -07:00
Hao Lu
a27a0b1ef5 [SR] Disable NNC temporarily (#62746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62746

Disable NNC temporarily until a code cache is implemented to reduce the compilation time.

Reviewed By: ajyu

Differential Revision: D30080326

fbshipit-source-id: ef8bb3ac3a6947614f4a03a3d52774b6933d3ea8
2021-08-04 17:33:07 -07:00
Raghavan Raman
7b6d569a2b [jit] Renamed prim::Concat as prim::VarConcat (#61983)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61983

Trial #2. The previous PR (https://github.com/pytorch/pytorch/pull/61498) was reverted because this caused a failure in `pytorch_linux_backward_compatibility_check_test`. Fixed that now by adding to the exception list in `check_backward_compatibility.py`.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D29828830

Pulled By: navahgar

fbshipit-source-id: 947a7b1622ff6e3e575c051b8f34a789e105bcee
2021-07-29 10:28:59 -07:00
Don Jang
68efa186cc [static runtime] Implement aten::full (#62227)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62227

Test Plan: Added `StaticRuntime.IndividualOps_Full` to cover the newly added code path.

Reviewed By: hlu1

Differential Revision: D29923649

fbshipit-source-id: 722950137c35ae325590a670b97f03b395e8eac3
2021-07-28 09:50:27 -07:00
Mike Iovine
e1bee3eb30 [Static Runtime] Add missing unit tests for static runtime ops (#62238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62238

Added tests for the following ops:

* `aten::mul`
* `aten::nan_to_num`
* `aten::stack`
* `aten::relu`
* `aten::tanh`

Reviewed By: hlu1

Differential Revision: D29914217

fbshipit-source-id: 6a6c39629310e7131127e24fdce7253ccdf80340
2021-07-27 14:12:21 -07:00
Mike Iovine
79eb8bb299 [Static Runtime] Enforce proper output dtype for many ops (re-land) (#62267)
Summary:
Re-land of D29935444
We previously had lots of ops with implementations like this:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = create_empty_like(input_0);
}
...
auto& out = p_node->Output(0);
some_func_out(inputs, out);
```
This would make the output have the correct shape. But it would
also take the dtype of `input_0`, which is not always correct.

This change transforms these blocks to:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = some_func(inputs)
} else {
  ...
  auto& out = p_node->Output(0);
  some_func_out(inputs, out);
}
```
This gives the output the correct shape and dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62267

Reviewed By: ejguan

Differential Revision: D29937253

Pulled By: malfet

fbshipit-source-id: d91ca5d5703490d7d349a1de2ad3bb09b0c33967
2021-07-27 08:54:09 -07:00
Erjia Guan
a3be2ecc3a Revert D29887367: [Static Runtime] Enforce proper output dtype for many ops
Test Plan: revert-hammer

Differential Revision:
D29887367 (f4136c5efc)

Original commit changeset: cef04bfa52ec

fbshipit-source-id: 32e89f2b6381930559dd746b535904c3e90fd52b
2021-07-27 07:29:09 -07:00
Mike Iovine
f4136c5efc [Static Runtime] Enforce proper output dtype for many ops
Summary:
We previously had lots of ops with implementations like this:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = create_empty_like(input_0);
}
...
auto& out = p_node->Output(0);
some_func_out(inputs, out);
```
This would make the output have the correct shape. But it would
also take the dtype of `input_0`, which is not always correct.

This change transforms these blocks to:
```
if (p_node->Output(0).isNone()) {
  p_node->Output(0) = some_func(inputs)
} else {
  ...
  auto& out = p_node->Output(0);
  some_func_out(inputs, out);
}
```
This gives the output the correct shape and dtype.

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29887367

fbshipit-source-id: cef04bfa52ec082ad3a9a32aa27c44e275c6b24c
2021-07-26 13:27:02 -07:00
Hao Lu
78f7d8ccfa [Static Runtime] Remove wrappers for aten::cat (#62067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62067

The wrapper for aten::cat is no longer needed after the variadic cat change in D29565344 (ae58a4c45d) .
Also added a simple test to test dynamic shapes, i.e., input tensors in args2 are larger than in args1.

Reviewed By: navahgar, mikeiovine

Differential Revision: D29864600

fbshipit-source-id: 44a712c2e776815c09e0bf5631412149b81274b2
2021-07-23 20:33:41 -07:00
Meghan Lele
1d2ea76afb clamp: port to structured kernel (#61361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61361

This PR ports the `clamp` kernel to the structured format. In addition, it introduces `OptionalScalarRef` as a replacement for `c10::optional<Scalar>&`. The latter, although it is a reference type, can still involve copying the contained `Scalar` (e.g. if the actual parameter is a `Scalar` or if a `c10::optional<Scalar>` is constructed just to call a kernel). `OptionalScalarRef` contains only a `const Scalar&`, and stores flag about whether the instance contains something inside the `Scalar` itself using a new tag.

For more information, see #55070.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29821533

Pulled By: SplitInfinity

fbshipit-source-id: 88d55df5a4b2c14b68a57e4905d90eea1b088d99
2021-07-23 02:02:07 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Raghavan Raman
ae58a4c45d [Static Runtime] Added a variadic cat operator (#61302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61302

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29565344

Pulled By: navahgar

fbshipit-source-id: 96f5f4546ec0e61eb7f87e016e026e7b62576248
2021-07-21 15:58:20 -07:00
Mike Iovine
28150fd0c8 [static_runtime] Implement aten::linear (#61595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61595

Add out variant wrapper for `aten::linear` in the static runtime

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D29684236

fbshipit-source-id: 94df6d7267b3f269b2cadf065f207648777147df
2021-07-16 08:55:43 -07:00
Hao Lu
a07b08136f [Static Runtime] Check unsupported up when enabling static runtime (#61613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61613

Reviewed By: ajyu, movefast1990

Differential Revision: D29663466

fbshipit-source-id: d819903b7227f534c0a4fffa5eeea2b5c0c04750
2021-07-14 02:13:51 -07:00
Hao Lu
ccd0977060 [Static Runtime] Support prim::GetAttr/SetAttr (#61505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505

The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.

Reviewed By: ajyu

Differential Revision: D29350173

fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
2021-07-10 14:06:06 -07:00
Don Jang
a74516d699 [static runtime] implement aten::log (#61393)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61393

Test Plan:
Added `StaticRuntime.IndividualOps_Log`

```
...
[ RUN      ] StaticRuntime.IndividualOps_Log
V0701 12:10:50.829100 3708165 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0
V0701 12:10:50.888468 3708165 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::log(%inp.1)
V0701 12:10:50.889098 3708165 impl.cpp:1279] Switch to out variant for node: %a.1 : Tensor = aten::clone(%3, %2)
```

Reviewed By: hlu1

Differential Revision: D29511622

fbshipit-source-id: 819fd7d90c084609a060efeadb3015e35acac517
2021-07-08 18:25:35 -07:00
Don Jang
c2b0af2560 [static runtime] Implement aten::sign (#61154)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61154

Test Plan:
Added `StaticRuntime.IndividualOps_Sign`

```
[djang@devvm861.prn0 ~/local/fbsource/fbcode/caffe2] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1
...
[ RUN      ] StaticRuntime.IndividualOps_Sign
V0701 12:05:31.836099 3679080 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0
V0701 12:05:31.898192 3679080 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::sign(%input.1)
V0701 12:05:31.898849 3679080 impl.cpp:1279] Switch to out variant for node: %4 : Tensor = aten::clone(%3, %2)
```

Reviewed By: hlu1

Differential Revision: D29518603

fbshipit-source-id: e47b96d037fea639c41052f3849c82bbfa5f482a
2021-07-07 12:29:25 -07:00
Mike Guo
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
Hao Lu
46595a9623 [Static Runtime] Add gflag to disable nnc and caffe2 math library (#61090)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61090

Reviewed By: ajyu

Differential Revision: D29479860

fbshipit-source-id: 2b53405f41d319f074c75d8923d97fd6a45fee4b
2021-07-01 00:01:37 -07:00