Commit Graph

53 Commits

Author SHA1 Message Date
Yuanyuan Chen
36871622f1 [2/N] Mark unused parameters in C++ code (#165121)
This is follow-up of #164912 to mark unused C++ parameters to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121
Approved by: https://github.com/Skylion007
2025-10-15 03:04:39 +00:00
Jeddie Ji
20ec61a02f [BE] fix lint errors caused by const SROpFunctor fn (#154552)
Summary: Remove const quaiflier from SR suggsted from CLANGTIDY.

Test Plan: arc lint -a -e extra --take CLANGTIDY caffe2/torch/fb/sparsenn/cpu_operators/to_dense_representation_cpu.cpp

Reviewed By: henryoier

Differential Revision: D75534056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154552
Approved by: https://github.com/Skylion007
2025-05-29 19:40:08 +00:00
cyy
45efa1aaa8 [3/N] Use internal linkage in C++ files (#151297)
Follows #151070.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
PyTorch MergeBot
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
Richard Barnes
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
Nikita Shulga
ad8aef0f98 [BE] [3/N] Use nested namespaces (#110314)
Mostly in torch/csrc/jit/runtime and in `ATen/cuda/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110314
Approved by: https://github.com/seemethere
2023-09-30 02:23:48 +00:00
Scott Wolchok
cc798f1a4f [PyTorch] add c10/util/FbcodeMaps.h (#96359)
Allow us to use folly maps in fbcode and std maps for compatibility in OSS, extending what static runtime is already doing.

Differential Revision: [D43926670](https://our.internmc.facebook.com/intern/diff/D43926670/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96359
Approved by: https://github.com/ezyang
2023-03-10 02:18:16 +00:00
cyy
f27e09de04 Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927)
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
2023-02-27 19:22:20 +00:00
Max Podkorytov
bf62ece536 [static-runtime] add schema checks to most of the ops where these checks are missing (#84163)
Test Plan: existing unit tests; also fix some failing ones along the way

Differential Revision: D39074902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84163
Approved by: https://github.com/mikeiovine
2022-09-01 17:21:22 +00:00
Mike Iovine
9e32cdeda6 [SR] Use at::DimVector in reshape_copy_out (#76473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473

Avoid some extra heap allocations by using DimVector
ghstack-source-id: 155569314

Test Plan: Existing unit tests

Reviewed By: navahgar, huiguoo

Differential Revision: D35972439

fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4
(cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)
2022-05-05 17:31:54 +00:00
Don Jang
c62de0ac15 [Static Runtime] [Code Cleanup] Use SROperator for operators' function type (#73450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450

This change uses `SROperator` for operators' function type

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D34483246

fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
2022-03-01 02:30:48 +00:00
Scott Wolchok
bf82d2012e [PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247

Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314

Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?

Reviewed By: mikeiovine

Differential Revision: D33556198

fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
2022-01-14 14:32:40 -08:00
Don Jang
c97dc9286d Revert D32780415: [Static Runtime] Move implementation details from impl.h into internal.h
Test Plan: revert-hammer

Differential Revision:
D32780415 (999e93e6a8)

Original commit changeset: 119b7aedbf56

fbshipit-source-id: 1aa777e8c1854ab27e86bc625188f7170097fac8
2021-12-04 19:44:07 -08:00
Don Jang
999e93e6a8 [Static Runtime] Move implementation details from impl.h into internal.h (#69274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274

`impl.h` is the main header file that defines the interface of Static Runtime to its clients.

However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file.

To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients.

This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface.

Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future.

Test Plan: Existing unittests

Reviewed By: donaldong

Differential Revision: D32780415

fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f
2021-12-04 14:48:28 -08:00
Ansha Yu
7342b654a1 [static runtime] dequantize out variant (#68664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68664

Reland D32187063 (f120335643), fixing lint
Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Test on ctr_mbl_feed model 307210374 on 696 inputs

Before:
0.0569016 ms.   0.296647%. aten::dequantize (10 nodes)

After:
0.0423128 ms.   0.220481%. static_runtime::dequantize_copy (10 nodes, out variant)

Reviewed By: mikeiovine

Differential Revision: D32566429

fbshipit-source-id: b95dfc4c5e4115e083794093bc1571c7b1d72f5b
2021-11-30 09:03:26 -08:00
Alban Desmaison
748d9d2494 Revert D32187063: [static runtime] dequantize out variant
Test Plan: revert-hammer

Differential Revision:
D32187063 (f120335643)

Original commit changeset: 1fec6b74c7d3

fbshipit-source-id: 9770f8379e9ddba9e537fef0e66cc93c2caaf860
2021-11-18 10:12:31 -08:00
Ansha Yu
f120335643 [static runtime] dequantize out variant (#67873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67873

Add out variant for aten::dequantize

Test Plan:
Test on inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1
```

Before:
0.047472 ms.   0.409729%. aten::dequantize (9 nodes)

After
0.0307179 ms.   0.267204%. static_runtime::dequantize_copy (9 nodes, out variant)

Reviewed By: hlu1

Differential Revision: D32187063

fbshipit-source-id: 1fec6b74c7d3f25d0f445775c4558d30c55dcece
2021-11-18 09:31:27 -08:00
Ansha Yu
01b30922dd [static runtime] fuse gather+to+lengths_to_offsets (#64075)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64075

Test Plan:
Before:
`I0826 17:17:54.165174 1064079 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.66724. Iters per second: 149.987`

After:
`I0826 17:13:07.464485 1040300 PyTorchPredictorBenchLib.cpp:313] PyTorch run finished. Milliseconds per iter: 6.46362. Iters per second: 154.712`

Profile after: P453143683

Accuracy tested comparing with jit interpreter for no differences under 1e-3 (nnc ops turned on) https://www.internalfb.com/intern/diff/view-version/136824794/

======

With 100-request recordio inputs (211 inputs)

Before:
`I1101 12:43:13.558375 742187 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.7882. Iters per second: 84.8309`
After:
`I1101 13:50:41.087644 1126186 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 11.6763. Iters per second: 85.6438`

Profile after: P465977010
Constituent ops before (total is 0.5646):
```
       0.187392 ms.    1.61737%. fb::clip_ranges_gather (309 nodes, out variant)
       0.174101 ms.    1.50266%. fb::lengths_to_offsets (464 nodes, out variant)
       0.203126 ms.    1.75317%. static_runtime::to_copy (805 nodes, out variant)
```
Constitutent ops after (total is 0.4985):
```
       0.376559 ms.    3.25614%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
      0.0614349 ms.   0.531235%. fb::lengths_to_offsets (159 nodes, out variant)
      0.0573315 ms.   0.495751%. static_runtime::to_copy (195 nodes, out variant)
     0.00325543 ms.  0.0281501%. fb::gather_ranges (4 nodes, out variant)
```

Compare with jit interpreter inside benchmark:
`I1101 13:55:53.013602 1149446 PtVsBlackBoxPredictorBenchLib.cpp:175] Finished comparing PT static runtime and jit interpreter results`

======

Casting on the fly:

a. Static runtime off
```
Static runtime ms per iter: 11.4658. Iters per second: 87.2159
0.220367 ms.    1.94726%. static_runtime::to_copy (805 nodes, out variant)
0.172585 ms.    1.52504%. fb::clip_ranges_gather (309 nodes, out variant)
0.157836 ms.    1.39471%. fb::lengths_to_offsets (464 nodes, out variant)
```

b. Casting on the fly, using explicit allocation+to_copy (which has the fast pass for certain cases, but we'll always call empty):
```
I1115 09:08:35.711972 1925508 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 11.6732. Iters per second: 85.6662

0.599439 ms.    5.25098%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.0552475 ms.   0.483958%. fb::lengths_to_offsets (159 nodes, out variant)
0.0576032 ms.   0.504593%. static_runtime::to_copy (195 nodes, out variant)
0.00299026 ms.  0.0261941%. fb::gather_ranges (4 nodes, out variant)
```

c. Casting on the fly with native::to (no explicit allocation, but no fast pass):
```
Static runtime ms per iter: 11.5627. Iters per second: 86.4849
0.454356 ms.     3.9652%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
0.06315 ms.   0.551115%. static_runtime::to_copy (195 nodes, out variant)
0.0590741 ms.   0.515544%. fb::lengths_to_offsets (159 nodes, out variant)
0.00359182 ms.   0.031346%. fb::clip_ranges_gather (4 nodes, out variant)
```

d. Removal of the to() call in question from the fusion pattern:
```
Static runtime ms per iter: 11.3658. Iters per second: 87.9836
 0.29591 ms.     2.6479%. fb::clip_ranges_to_gather_to_offsets (305 nodes, out variant)
 0.154612 ms.    1.38352%. static_runtime::to_copy (500 nodes, out variant)
0.0567151 ms.   0.507505%. fb::lengths_to_offsets (159 nodes, out variant)
0.0051115 ms.  0.0457394%. fb::clip_ranges_gather (4 nodes, out variant)
```

Reviewed By: hlu1

Differential Revision: D30515441

fbshipit-source-id: 53acee10619ac2be7dc8982e929e3210c4bb6d21
2021-11-17 00:49:31 -08:00
Mike Iovine
07c5cb8c48 [Static Runtime] Optimize memory planner initialization (#64101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101

Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.

There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.

Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: movefast1990

Differential Revision: D30595579

fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
2021-08-27 17:40:43 -07:00
Rong Rong (AI Infra)
7f1b672b7a Revert D29952381: [Static Runtime] Ensure that unittests only use out variants or native ops
Test Plan: revert-hammer

Differential Revision:
D29952381 (8737e17af2)

Original commit changeset: e60e70b80ccf

fbshipit-source-id: 59dc2f920b7ceaf94ba8f5f36024e7cc710f6645
2021-08-04 14:25:11 -07:00
Don Jang
8737e17af2 [Static Runtime] Ensure that unittests only use out variants or native ops (#62335)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335

This change ensures that unittests only use out variants or native ops.

- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.

- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.

- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op  = true` since they are written to use interpreter ops by design.

Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.

Reviewed By: mikeiovine, hlu1

Differential Revision: D29952381

fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
2021-08-04 11:37:15 -07:00
Raghavan Raman
ae58a4c45d [Static Runtime] Added a variadic cat operator (#61302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61302

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29565344

Pulled By: navahgar

fbshipit-source-id: 96f5f4546ec0e61eb7f87e016e026e7b62576248
2021-07-21 15:58:20 -07:00
Hao Lu
a07b08136f [Static Runtime] Check unsupported up when enabling static runtime (#61613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61613

Reviewed By: ajyu, movefast1990

Differential Revision: D29663466

fbshipit-source-id: d819903b7227f534c0a4fffa5eeea2b5c0c04750
2021-07-14 02:13:51 -07:00
Hao Lu
ccd0977060 [Static Runtime] Support prim::GetAttr/SetAttr (#61505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505

The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.

Reviewed By: ajyu

Differential Revision: D29350173

fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
2021-07-10 14:06:06 -07:00
Hao Lu
2112074f25 [Static Runtime] Add schema check to several aten ops (#59603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59603

D28698997 (10345010f7) was reverted because I forgot to replace the
```
  VLOG(1) << "Found schema mismatch";
  n->schema().dump();
```
block in `aten::clamp_min` with `LogAndDumpSchema(n)` and that led to the bazel build to fail. I don't know why it makes the bazel build though.

Test Plan: OSS CI.

Reviewed By: ajyu

Differential Revision: D28950177

fbshipit-source-id: 9bb1c6619e6b68415a3349f04933c2fcd24cc9a2
2021-06-10 23:39:00 -07:00
Rong Rong (AI Infra)
91eb831422 Revert D28698997: [Static Runtime] Add schema check to aten ops
Test Plan: revert-hammer

Differential Revision:
D28698997 (10345010f7)

Original commit changeset: 232fc60c0321

fbshipit-source-id: e351df62779fea85b7afe5160d3c40c4e7cee4ed
2021-06-05 07:48:49 -07:00
Hao Lu
10345010f7 [Static Runtime] Add schema check to aten ops (#59426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59426

Reviewed By: ajyu

Differential Revision: D28698997

fbshipit-source-id: 232fc60c0321b8e68e4f1b6705233485260c281d
2021-06-04 21:38:45 -07:00
Hao Lu
c3d40fdf56 [ATen] Use expect_contiguous in layer_norm (#58067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58067

- Use expect_contiguous in layer_norm to avoid unnecessary refcount bumps when the tensors are contiguous
- Clean up some leftovers from the hacky wrappers removal cleanup: use c10::MaybeOwned<Tensor> for bias tensors
- Skip dispatcher for at::empty in the layer_norm impl in Static Runtime

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D28214298

fbshipit-source-id: 73150fa62d5c18f41a2264f8e56bbe5e377ad045
2021-05-11 22:56:32 -07:00
Hao Lu
32acc96f78 [Static Runtime] Fix bug in aten::clone (#58100)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58100

aten::clone has a second arg, memory_format, which was not previously supported.

Reviewed By: ajyu

Differential Revision: D28347171

fbshipit-source-id: e083cc24c3228048429bba3497326415bc3d1f5a
2021-05-11 22:47:25 -07:00
Hao Lu
5439977352 [Static Runtime] Revamp op schema check (#57521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521

When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.

This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:

```
  if (node->kind() != prim::ListConstruct &&
      node->kind() != prim::TupleConstruct &&
      node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
    const Operator& op = node->getOperator();
    TORCH_CHECK(op.hasOperation());
    op_ = op.getOperation(node);
    VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
  }
```

The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D27531447

fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
2021-05-04 02:48:04 -07:00
Edvard Ghazaryan
b3e1802439 Static runtime support for fb::expand_dims (#57282)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57282

Added support for fb::expand_dims for SR.

Test Plan:
buck test caffe2/torch/fb/sparsenn:gpu_test -- test_expand_dims

buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

Reviewed By: hlu1

Differential Revision: D28043049

fbshipit-source-id: 01f59db7b507f027b220f044d6ff23602adbdb06
2021-04-29 22:40:56 -07:00
Hao Lu
33f206b865 [StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447

MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.

Test Plan: CI

Reviewed By: ajyu

Differential Revision: D27840361

fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
2021-04-20 23:04:01 -07:00
Peng Wu
18662d4321 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809

[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime'

Reviewed By: bwasti

Differential Revision: D27411416

fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c
2021-04-13 11:04:47 -07:00
Hao Lu
c3d0607ffa [Static Runtime] Make sure the copy version of the op exist in ReplaceWithCopy (#55337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337

`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.

Test Plan:
This fixed two of the broken tests:
```
    ✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
    ✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```

Reviewed By: ajyu

Differential Revision: D27577066

fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
2021-04-06 04:25:04 -07:00
Hao Lu
a8ecf306da [Static Runtime] Remove dead code (#53588)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53588

Remove `SRViewOperatorRegistry` and related code now that it's no longer needed.

Reviewed By: swolchok

Differential Revision: D26901367

fbshipit-source-id: fa73501cd785d4b89466cda81481aea892f8241f
2021-03-09 13:36:41 -08:00
Bram Wasti
56f8379802 [static runtime] Move all heavy constructor logic into InferenceModule (renamed to StaticModule) (#51564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51564

Constructor logic was spread throughout InferenceModule and StaticRuntime.  This diff unifies the two.  After a lot of discussion on this diff D25961626 it became apparent that `clone` is uglier than a cheap StaticRuntime.

This means StaticRuntime is effectively StaticModule and the only code in the new StaticRuntime is the `run` functions.

```
graph, schema = PrepareForStaticModule(torchscript_module)
sm = StaticModule(graph, schema, options)
sm(inputs)
// or create many cheap runtimes with the module
sr = StaticRuntime(sm)
sr(inputs)
```

Changelist:
- Rename InferenceModule StaticModule
- Move all logic for construction into StaticModule
- Create a new StaticRuntime that only has a unique memory planner (everything else is in StaticModule)
- Update comments with explanation
- Propagate all changes to predictor integration
- Propagate all changes to python integration
- Change semantics to be a bit more PyTorch-standard (no "run" calls, no "get_" getters).

Test Plan:
buck test //caffe2/test:static_runtime
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25592967

fbshipit-source-id: 8233bed03137ce129137af2d44bce0095033ef0f
2021-03-05 10:15:26 -08:00
Hao Lu
63e0e88ccc [PyPer] More at::empty -> at::detail::empty_cpu (#53333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53333

- Add more variants to `create_empty_from` to take more args, like dtype/layout/device.
- Clean up stray at::empty uses, mostly in the out variants.

Reviewed By: ajyu

Differential Revision: D26799900

fbshipit-source-id: 6676d8043fead63208913ef3a28cabbae76e46bb
2021-03-05 00:16:51 -08:00
Hao Lu
248e8b42fa [Static Runtime] Use native version of at::empty (#53216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53216

- at::native::empty_cpu calls at::detail::empty_cpu without any changes to the arguments. So we could call at::detail::empty_cpu directly.
- There is no need to create a TensorOptions object first since we can get all the relevant information from the tensor directly.

Reviewed By: bertmaher, swolchok

Differential Revision: D26792255

fbshipit-source-id: 7a4e368a19cea79e136e34dab854cb1d37dbeb58
2021-03-03 17:13:26 -08:00
Bram Wasti
d4e64dad15 [static runtime] Register both TupleConstruct and ListConstruct as out variants (#52684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52684

With alias analysis we get much more powerful registration and we can start removing "native" and fallback interpreted implementations.  `inputsOutOfPlace` is an artifact of the hardcoded "native" and lax fallback implementations.  Ideally every node will run out of place every time.  Afaik, there's never a reason to disable it and we may want to remove that functionality.

This diff does introduce a "leak" in the memory management - containers are not cleaned up.  This only happens when out variants are enabled

Test Plan: buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --run-disabled

Reviewed By: maratsubkhankulov, hlu1

Differential Revision: D26515801

fbshipit-source-id: 7391d66b9d36e15fc2955a5c34a04d027d18fe78
2021-03-02 09:55:25 -08:00
Hao Lu
11cda929fb [StaticRuntime] Fix bug in MemoryPlanner (#51342)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342

There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.

```
  def forward(self, a: Tensor, shape: List[int]):
      b = a.reshape(shape)
      return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.

To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.

Test Plan:
Add unit test to enforce the constness of inputs

```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: ajyu

Differential Revision: D26144203

fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
2021-01-29 21:16:02 -08:00
Hao Lu
d035d56bfb [StaticRuntime] Add out variant for reshape and flatten (#51249)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249

- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.

Reviewed By: ajyu

Differential Revision: D25992202

fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
2021-01-27 22:44:11 -08:00
Bert Maher
aa3c28a29e [static runtime] Shortcut resize_({0})
Summary:
We do a lot of resize_({0}) to force `out` operators to properly
resize their results, and `resize_` does a fair bit of extraneous work
(e.g. trip through dispatch, checks for memory_format and named tensors, etc.).
If we strip it down to the bare minimum it's just setting the sizes to 0, so
lets do that directly.

Test Plan:
Perf results suggest maybe a 1% win:
```
batch 20: P163138256 (large win, 1.7%, mostly in fb_fc_out)
batch 1: P163139591 (smaller win, 0.88%, mostly in resize_)
```

Reviewed By: swolchok

Differential Revision: D25932595

fbshipit-source-id: d306a0a15c0e1be12fde4a7f149e3ed35665e3c0
2021-01-21 17:08:47 -08:00
Scott Wolchok
c6cb632c63 [PyTorch] Make SROpFunctor a raw function pointer (#50395)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50395

There's no need for these to be `std::function`.
ghstack-source-id: 119684828

Test Plan: CI

Reviewed By: hlu1

Differential Revision: D25874187

fbshipit-source-id: e9fa3fbc0dca1219ed13904ca704670ce24f7cc3
2021-01-13 15:51:14 -08:00
Bram Wasti
ace1680b68 [static runtime] Remove register concept by giving ownership to the nodes (#50050)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50050

Every node will now own its outputs.
I don't expect any big improvements perf-wise from this diff, the only eliminated code is from deallocate_registers
Largely, this is to enable more optimizations going forward.

Test Plan:
buck test mode/dev //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/test:static_runtime

Reviewed By: hlu1

Differential Revision: D25571181

fbshipit-source-id: 91fcfbd5cd968af963ba89c45656997650ca6d18
2021-01-07 10:19:58 -08:00
Bram Wasti
274ce26fd8 [static runtime] Add Internal Ops to the registry (#48616)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48616

This adds a couple of _out variants and then registers them to the registry.

I also added the concept of "canReuse{Input,Output}" so that we can annotate tensors that are not optimizable (specifically, non-float tensors).

In the future we can change this (with this D25062301)

after removing `RecordFunction`, we see these results

```
BS=20
 ---
caffe2:           0.651617 ~ 0.666354
static runtime:   0.753481
pytorch:          0.866658

BS=1
 ---
caffe2:           0.0858684 ~ 0.08633
static runtime:   0.209897
pytorch:          0.232694
```

Test Plan: standard internal test of ads model against caffe2 reference (see the scripts in this quip: https://fb.quip.com/ztERAYjuzdlr)

Reviewed By: hlu1

Differential Revision: D25066823

fbshipit-source-id: 25ca181c62209a4c4304f7fe73832b13e314df80
2020-12-08 09:32:38 -08:00
Bram Wasti
286cdf3cda [static runtime] add static registry (#48258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48258

This will enable closed source contributions

Test Plan: buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D25031586

fbshipit-source-id: def859fa2fb4f01910b040242662a51b85804f01
2020-11-20 17:05:24 -08:00
Katy Voor
fe7d1d7d0e Add LeakyReLU operator to static runtime (#47798)
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case

Static Runtime
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              4092 ns       4092 ns     172331
BM_leaky_relu/8                              4425 ns       4425 ns     158434
BM_leaky_relu/20                             4830 ns       4830 ns     145335
BM_leaky_relu_const/1                        3545 ns       3545 ns     198054
BM_leaky_relu_const/8                        3825 ns       3825 ns     183074
BM_leaky_relu_const/20                       4222 ns       4222 ns     165999
```

Interpreter
```
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1                              7183 ns       7182 ns      96377
BM_leaky_relu/8                              7580 ns       7580 ns      91588
BM_leaky_relu/20                             8066 ns       8066 ns      87183
BM_leaky_relu_const/1                        6466 ns       6466 ns     107925
BM_leaky_relu_const/8                        7063 ns       7063 ns      98768
BM_leaky_relu_const/20                       7380 ns       7380 ns      94564
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798

Reviewed By: ezyang

Differential Revision: D24927043

Pulled By: kavoor

fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00