Commit Graph

222 Commits

Author SHA1 Message Date
mikeiovine
2ae3c59e4b [SR] Remove linear/relu fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620

Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC.

Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/)

Approved by: https://github.com/hlu1
2022-05-23 21:46:27 +00:00
Hao Lu
c60d2ef4eb [StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832)
Reviewed By: mikeiovine

Differential Revision: D36466622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832
Approved by: https://github.com/mikeiovine
2022-05-20 03:14:01 +00:00
mikeiovine
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
Scott Wolchok
52af4fc5ba [PyTorch] Make RecordFunction store inputs as ArrayRef (#72484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72484

Stepping stone toward stack-allocating array of inputs.

Funnily enough, this seems to improve performance too.
ghstack-source-id: 155492056

Test Plan:
1) CI
2) framework overhead benchmark with --stressTestRecordFunction --captureRecordFunctionInputs goes from 0.76 usec/iter to 0.72.

Reviewed By: chaekit, robieta

Differential Revision: D34061169

fbshipit-source-id: 073fedf1d3d162f927c4e9867cfda7dbfabba215
(cherry picked from commit dae77cf1cd8813d902d73999ad97133a3ef8e291)
2022-05-05 21:38:42 +00:00
Mike Iovine
fc64dbdc01 [SR] Fuse quantized linear/relu (#75775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775

fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35633297

fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
2022-05-04 19:01:14 +00:00
Mike Iovine
b02b3f25db [SR] Quick hack to eliminate no-op slice (#75774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774

`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35632947

fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
2022-05-03 19:29:46 +00:00
Ansha Yu
dcdc7b2ffc [RF][scuba] add pytorch_operator_stats column for Static Runtime out variant (#76566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76566

Add static_runtime_out_variant field to pytorch_operator_stats scuba.

Add field for static_runtime_out_variant to RecordFunction.

Test Plan:
`ptail perfpipe_pytorch_operator_stats_dev | grep devbig371`
No out variant, SR on: P498206546
Out variant: P498206634

Check column shows up in scuba: https://fburl.com/scuba/pytorch_operator_stats_dev/tfgmth1t

CMF 4M test https://www.internalfb.com/intern/servicelab/802987274/
ICVR 4M https://www.internalfb.com/intern/servicelab/802987272/

AF prod canary
https://our.intern.facebook.com/intern/ads/canary/443234131523314631

Reviewed By: mikeiovine

Differential Revision: D36016857

fbshipit-source-id: f3af315d1d2b0d8b147be76df63daa8ab872bf8e
(cherry picked from commit 208db7cd15fb3e1be66db2d1834eebaf0964d017)
2022-05-03 17:20:28 +00:00
Ansha Yu
5f98145fd9 [scuba] log to pytorch_model_stats when we've tried and failed to enable static runtime
Differential Revision: D35848179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76240
Approved by: https://github.com/huiguoo, https://github.com/osalpekar
2022-05-02 18:25:39 +00:00
Ansha Yu
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
Nikita Shulga
f6c275f55d Remove -Wno-unused-variable from utils.cmake (take 2) (#75538)
Summary:
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538

Reviewed By: anjali411

Differential Revision: D35747333

Pulled By: malfet

fbshipit-source-id: 3fc5828e44a4c05ba0e89e92613e6ebbdb260626
(cherry picked from commit c179fba21cfa2a0093fad50ccad5a22dd7cff52c)
2022-04-20 17:41:59 +00:00
Taylor Robie
a5e338a826 [RecordFunction] More effecient machinery to determine which callbacks to run. (#75807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75807

There is a tension in RecordFunction between two use cases:
1) In the normal eager path we don't run any callbacks, so we need to bail out of the profiling path as soon as possible to minimize eager overhead.
2) When profiling we want to determine which callbacks to run as efficiently as possible to minimize instrumentation overhead.

The confounding factor in all of this is sampling callbacks because they change which callbacks will run on each call, even in steady state operation. This has traditionally been handled with a two stage procedure: first we flip a coin to determine if a sampled callback *might* run. If false (which it usually is), do nothing. This solves (1). If true, check to see if we need to build the full callback set or if it was a false positive. This procedure has two negative effects:
* It forces us to rebuild the set of callbacks to run on every step when profiling
* It leaks the sampling abstraction, requiring other parts of the code to bump certain values and forces RecordFunction to lazily initialize.

This change introduces a multi-level cache which can (in the common case) quickly determine which callbacks *will* run, rather than if callbacks *might* run. This means that rather than call `shouldRunRecordFunction`, we can simply get the callbacks for an invocation and check if they are empty. (And completely removes the pre-sampling heuristic.) Another major benefit of the new cache structure is that it allows thread-safe registration and unregistration of global callbacks.

It's worth briefly discussing how this maintains eager performance. In the standard eager case (only sampling callbacks registered) the cache first checks that the global callbacks haven't changed (atomic read), decrements a counter to see if a sampling callback fired, and then returns the active callbacks which is simply a SmallVector of pointer pairs and a couple POD values (scope, needs inputs/outputs/ids). The biggest cost according to perf is the SmallVector logic; we could consider adopting a hard limit on active callbacks; more than half a dozen callbacks *running* in a single step would be quite a lot. But the total cost relative to `PYTORCH_DISABLE_PER_OP_PROFILING` is only ~10ns, so debatable if it's worth it to switch to `std::array`.

The primary change is in `record_function.cpp`, which has a more detailed description of the new cache structure. `record_function.h` has some minor changes to align with the new calling convention and the remaining files are simply changes to the call sites.

Future work:
  * RecordFunction no longer needs to be lazily initialized.
  * We can deprecate the disable/reenable APIs, since we can not safely add and remove global callbacks.

Test Plan:
I tested eager mode performance using the overhead benchmark and found that the non-profiled path was unaffected. However the no-op observer dropped from 0.41us to 0.37us (0.25us if no observers are active) which is about 1/3rd reduction in the cost of the callback selection machinery.

I also added several C++ unit tests, as the core RecordFunction machinery (especially sampling) was largely untested.

Reviewed By: swolchok, davidberard98

Differential Revision: D35276158

fbshipit-source-id: 35135f444724fba4eb97c0ae7f3f710f0f9016fd
(cherry picked from commit 9e359b87422c18f2a195185f32e7e85c82f956fd)
2022-04-19 20:46:16 +00:00
PyTorch MergeBot
5c56b2286b Revert "Remove -Wno-unused-variable from utils.cmake"
This reverts commit 018cbe1f5c.

Reverted https://github.com/pytorch/pytorch/pull/75538 on behalf of https://github.com/seemethere
2022-04-19 17:19:09 +00:00
Nikita Shulga
018cbe1f5c Remove -Wno-unused-variable from utils.cmake
[Comment](https://github.com/pytorch/pytorch/pull/62445/files#r680132022) claims, it got added for consistency with  top level CMakeLists.txt, but `-Wno-unused-variable` is not mentioned there.

Modify violations in 50+ files that were added in the interim by either removing unused variables, or decorating the code with `C10_UNUSED` if local variable is likely used to extend object lifetime until the end of the block.

Caused preventable revert in https://github.com/pytorch/pytorch/pull/72633#issuecomment-1092300787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75538
Approved by: https://github.com/cpuhrsch
2022-04-19 15:26:55 +00:00
Mike Iovine
b7682d351a [SR] Refactor memory planner to prepare for new algorithm (#74730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74730

Motivation: I am working on implementing a new, more efficient memory planning algorithm. This algorithm cannot replace the old one entirely, because it can only be practically done for models that have sample inputs to warm up with. We need a way to make the memory planner's strategy extensible.

My first pass attempt at implementing the new algorithm crammed everything into the same class, but it became a nightmare to manage (a ton of `if (use_new_strategy)` statements everywhere). Additionally, it was a little clumsy since there are some concepts that make sense for one algorithm but not the other (like `StorageGroup`).

It's much cleaner if we instead turn `MemoryPlanner` into an abstract base class and have different subclasses implement their strategies in `allocateManagedTensors` and `deallocateManagedTensors`.
ghstack-source-id: 153288210

Test Plan: Existing unit tests

Reviewed By: navahgar, hlu1

Differential Revision: D35132124

fbshipit-source-id: c5ef5ae6361b44dedf97090201e244a76e1e6bce
(cherry picked from commit c96f6827c8db88f28c4eb379865ad208beae2034)
2022-04-07 22:11:19 +00:00
Mike Iovine
2f98fa9147 [SR] Do not manage tensors that escape scope via container (#74966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966

It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:

```
def forward(self, cond: bool, a, b):
    lst = []
    if cond:
        res = a + b # res should not be managed!!!
        lst.append(res)
    return lst
```

The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.

The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.

This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188

Test Plan: New unit tests cover the problematic case

Reviewed By: navahgar

Differential Revision: D35257087

fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
2022-04-07 19:57:57 +00:00
Mike Iovine
4055d1f653 [SR] Fix StaticRuntime move ctor (#74927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927

The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.

Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602

Test Plan: New unit test/CI

Reviewed By: navahgar

Differential Revision: D35228467

fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
2022-04-07 02:16:37 +00:00
Mike Iovine
2ca66ffb7d [SR] Force split_and_squeeze usage via graph transformation (#74274)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274

Reviewed By: navahgar

Differential Revision: D34913889

fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458
(cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)
2022-03-29 19:13:40 +00:00
Mike Iovine
f5a9c36d0b [SR] Eliminate extra permute ops before aten::sum (#74481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481

This diff fixes an interesting performance issue related to `permute_copy`.

We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```

With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589

But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597

But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888

Reviewed By: navahgar

Differential Revision: D34992319

fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
2022-03-23 23:00:14 +00:00
Mike Iovine
f14a0be302 [SR] Avoid allocating rstd/mean in layer_norm (#73606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606

The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116

Test Plan: Existing unit tests

Reviewed By: hlu1

Differential Revision: D34562131

fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
2022-03-15 22:07:11 +00:00
Mike Iovine
818bf361b6 [SR] Fix a kwargs API default value bug (#73681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681

Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627

Test Plan: Added unit test to cover this case

Reviewed By: navahgar, d1jang

Differential Revision: D34588804

fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
2022-03-03 22:31:37 +00:00
Don Jang
bbc59ff2bf [Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536

Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:

1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`

2) fully instanced ones that are owned by `StaticRuntime`.

We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.

To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D33985600

fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
2022-03-02 22:33:32 +00:00
Mike Iovine
74fe57079f [SR] Add new fb::split_and_squeeze op (#73252)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73252

Test Plan: New unit tests

Reviewed By: d1jang

Differential Revision: D34399345

fbshipit-source-id: 80cbf3e1556f2152576a07d1eeb3bb6ef97096cf
(cherry picked from commit 259956749dad88f0960eecb60135cd466f1b56f4)
2022-03-02 19:31:42 +00:00
Raghavan Raman
2724e4c039 [Static Runtime] Do not replace with copy variants if TE fuser is enabled (#72946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72946

The passes to replace with copy variants are run after TensorExpr fusion. Due to this the resulting graph does not conform to the assumptions made in the fuser.

So, even if these flags `use_copy_variants`, `use_maybe_copy_variants` are turned on, the corresponding passes will not be executed if TensorExpr fusion is enabled.

ghstack-source-id: 149429753

Test Plan: Tested locally.

Reviewed By: mikeiovine

Differential Revision: D34283842

fbshipit-source-id: 74edea517a00c85dff0319f9c8b3ac8befe09018
(cherry picked from commit 3798af7f1b)
2022-02-18 18:34:50 +00:00
Don Jang
39fb771423 [Static Runtime] Report static op statistics from graph when input size is zero (#73032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73032

Currently, ptvsc2_predictor_bench reports nothing when the input size is zero. However, Static Runtime's module creation has some useful information even after loading a model.

This change reports static op statistics when the given input's size is zero. In addition to that, this enables it to report the out variant coverage percentage, which is crucial to establish the baseline performance of Static Runtime.

Test Plan: - Ran `ptvsc2_predictor_bench` with this change as seen above.

Reviewed By: mikeiovine

Differential Revision: D34294803

fbshipit-source-id: 80c02199075dae9280657d6edecc7c679c1c27f4
(cherry picked from commit 83aec141a2)
2022-02-17 23:58:32 +00:00
Mike Iovine
d1c5f9e439 [JIT][SR] Introduce prim::IfThenElse (#72587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587

This pattern frequently appears in a few graphs:

```
%result = prim::If(%condition)
  block0():
    -> (%a)
  block1():
    -> (%b)
```

This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.

This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:

```
%result = prim::IfThenElse(%condition, %a, %b)
```

Test Plan: New unit tests

Reviewed By: eellison

Differential Revision: D34091789

fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b)
2022-02-17 18:22:48 +00:00
Don Jang
5ea74b4996 [Static Runtime] Remove ProcessedNode::num_outputs_ (#72592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72592

Only code paths that are not perf-critical read `ProcessedNode::num_outputs_` and also its static feature of the op that `ProcessedNode` instance is executing.

Therefore, it's better to move `ProcessedNode::num_outputs_` into `ProcessedFunction::num_outputs_` and let `ProcessedNode` access it via `ProcessedNode::fn_` for its occasional use. Note that this prevents duplicating num_outputs_ per node & per Static Runtime instance since `ProcessedFunction` instances are shared across all runtime instances.

It's confirmed that this change reduces the `sizeof(ProcessedNode)` by 14% from local instrumentation as follows:

- Before
-- sizeof(ProcessedNode): 56

- After
-- sizeof(Processednode): 48

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: mikeiovine

Differential Revision: D33984792

fbshipit-source-id: e29ffc97b799e679215f42e1e85cd3fcd7e88983
(cherry picked from commit 0f7003f4df)
2022-02-17 05:09:17 +00:00
Mike Iovine
d51d2bd608 [SR] Add a flag to disable copy variants (#71102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71102

This graph pass is causing a major perf regression on some models. Ideally we would introduce maybe_copy variants for all these ops. But since those are tricky to write, I've introduced a flag to just turn the pass off for now.
ghstack-source-id: 148541673

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: navahgar

Differential Revision: D33510080

fbshipit-source-id: bb4847f26561197ea5e6bbad0a4d25db4ef468eb
(cherry picked from commit 8f333d3e81)
2022-02-08 02:43:07 +00:00
Mike Iovine
cff5e22a72 [SR] Relax aten::__is__ constraint for SR enablement (#71807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807

There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in

```
def forward(x):
    y = x.detach()
    # Should be false, but we get True
    # after our EliminateNoOps pass
    return x is y
```

Test Plan: New unit test covers this case

Reviewed By: d1jang

Differential Revision: D33783668

fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4
(cherry picked from commit 019588f4ca)
2022-02-03 12:18:46 +00:00
Mike Iovine
2d5296b0e7 [SR] Implement prim::Loop (#69838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838

Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483

Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`

Reviewed By: d1jang

Differential Revision: D33049595

fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af)
2022-02-02 19:30:50 +00:00
Mike Iovine
2aa699505d [SR] Implement prim::If (#69837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837

Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475

Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`

Accuracy test at top of stack

Reviewed By: d1jang

Differential Revision: D33045908

fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc08)
2022-02-02 19:30:50 +00:00
Mike Iovine
d2599701fd [SR] Force sub-blocks to return at least one output (#69836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836

It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned.

Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`.
ghstack-source-id: 148186453

Test Plan: Sub-blocks with no return values tested at top of stack

Reviewed By: d1jang

Differential Revision: D33050420

fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149
(cherry picked from commit c97fd07bf5)
2022-02-02 19:30:50 +00:00
Mike Iovine
238dded10f [SR] Graph pass to create owned refs of special IValues (#69835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835

`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.

Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

Added a new unit test with a graph that simply returns a constant.

Tests with sub-blocks at top of stack.

Reviewed By: d1jang

Differential Revision: D33047519

fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee)
2022-02-02 19:30:50 +00:00
Mike Iovine
3dce68fdf4 [SR] Eliminate op_name_ in ProcessedNode (#71986)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71986

To address concerns over space increase from control flow.

`op_name_` was only stored as a minor optimization to avoid name lookup during logging, we can safely get rid of it. Thanks to the sampling mechanism, `get_op_name()` is called very infrequently, so this shouldn't cause too much of a regression
ghstack-source-id: 148086244

Test Plan: CI

Reviewed By: d1jang

Differential Revision: D33821005

fbshipit-source-id: 6f74eb30a54a046ca90768aebbcde22e8c435f35
(cherry picked from commit 361ba32e97)
2022-02-01 21:22:26 +00:00
Mike Iovine
4b789df68b [SR] Add BlockRunner and handle sub-blocks (#69834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834

* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: d1jang

Differential Revision: D33028039

fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd92)
2022-02-01 17:20:55 +00:00
Mike Iovine
a49f2412e4 [SR] Add static runtime scopes to record function (#70944)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70944

Added special net-level/op-level scopes for static runtime. We can use these to add special behavior in record functions when they are invoked from a static runtime context.

Reviewed By: navahgar

Differential Revision: D33458211

fbshipit-source-id: 0b7022100e9f5ac872f4cb5bfba14e92af2c71b0
(cherry picked from commit b486548544)
2022-01-27 18:00:08 +00:00
Mike Iovine
7e6312a5df [SR] Reverse iteration order in resetMemory (#71705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705

This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows *before* we destroy the owning `IValue`, not after.
ghstack-source-id: 147688982

Test Plan:
New unit test covers this case

ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors

Reviewed By: ajyu

Differential Revision: D33725435

fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb
(cherry picked from commit 31339d9d38)
2022-01-26 17:35:03 +00:00
Scott Wolchok
3a77fb244b [PyTorch][Static Runtime] Delete cleanup_activations option (#71501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501

This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569

Test Plan: CI, rerun broken test from task

Reviewed By: mikeiovine

Differential Revision: D33669290

fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109)
2022-01-21 18:15:43 +00:00
Scott Wolchok
1bbea3c3a2 [PyTorch][JIT] Support mayContainAlias(Value*, ArrayRef<Value*>) (#69853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69853

We can implement this overload more efficiently.
ghstack-source-id: 146924693

Test Plan:
patched alias_analysis tests

Time reported to initialize a predictor by static runtime when given ctr_mobile_feed local_ro net is 9.5s instead of 10.5s.

Reviewed By: mikeiovine

Differential Revision: D33039731

fbshipit-source-id: 52559d678e9eb00e335b9e0db304e7a5840ea397
2022-01-12 16:53:54 -08:00
Scott Wolchok
cd253938a9 [PyTorch][SR][easy] s/input_or_constant_aliases/external_aliases/ (#69852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69852

Looks like a stale comment.
ghstack-source-id: 146924694

Test Plan: review

Reviewed By: hlu1

Differential Revision: D33033264

fbshipit-source-id: aa0eff463c42716bdd7142d4662d8668af439f68
2022-01-12 16:52:26 -08:00
Elias Ellison
9bccb31306 Remove precise tuple construct flag (#71121)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71121

Test Plan: Imported from OSS

Reviewed By: d1jang

Differential Revision: D33515234

Pulled By: eellison

fbshipit-source-id: 57cfe171b583a6bb4d3493a34b159061e97a11b8
2022-01-11 22:12:36 -08:00
Scott Wolchok
10b40acbdb [PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122

See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.

If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

--do_profile output for local_ro (updated Dec 10):

```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```

for local: (also Dec 10):

```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)

==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```

Reviewed By: hlu1

Differential Revision: D32318674

fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
2022-01-04 22:36:13 -08:00
Scott Wolchok
4d8fc8693c [PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223

ghstack-source-id: 146482215

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

Reviewed By: hlu1

Differential Revision: D31776259

fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
2022-01-04 22:36:10 -08:00
Scott Wolchok
1507ce90b2 [PyTorch][Static Runtime] Avoid managed output tensor DCHECK (#67221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67221

Update memory leak checks to not require that output tensors are cleaned up.
ghstack-source-id: 146464297

Test Plan: Tests should still pass;  reviewers to confirm that this is OK in principle

Reviewed By: d1jang

Differential Revision: D31847567

fbshipit-source-id: bb7ff2f2ed701e2d7de07d8032a1281fccabd6a9
2022-01-04 22:36:07 -08:00
Peter Bell
fa09099ba3 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D33336948

Pulled By: albanD

fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
2022-01-02 13:09:19 -08:00
Mike Iovine
682fab19d4 [SR] verify_and_correct_memory_overlap handles tensor lists (#69774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69774

We recently ran into a nasty bug caused by incorrect schema annotations on an `aten::split` overload. `verify_and_correct_memory_overlap` is supposed to prevent crashes in this scenario, but it didn't because it did not handle `Tensor[]` outputs.

This change extends the memory correction mechanism to handle tensor lists.
ghstack-source-id: 146152478

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D33022494

fbshipit-source-id: 8d1d41ca1d4fd5dfb7c8a66028c391ba63551eb0
2021-12-22 17:18:18 -08:00
Raghavan Raman
a6f953156e [StaticRuntime] Add TensorExpr fusion with dynamic shapes in SR (#69475)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69475

This diff adds TensorExpr fusion with dynamic shapes in SR. This includes tracing the input graph with sample inputs, and then performing fusion with generalization to get fused graphs with dynamic shapes.
ghstack-source-id: 146059043

Test Plan:
```
buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```

Reviewed By: d1jang

Differential Revision: D32320088

fbshipit-source-id: 397f498878ddfcee9dad7a839652f79f034fefe3
2021-12-21 12:41:02 -08:00
Raghavan Raman
91da2d5fa1 [StaticRuntime] Refactor StaticModule to pass in sample inputs (#69473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473

This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041

Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test

Reviewed By: donaldong

Differential Revision: D32320084

fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
2021-12-21 11:20:25 -08:00
Peter Bell
ef70174f2e Separate c10::Symbol header from list of interned strings (#69406)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69406

Most files that include `interned_strings.h` don't actually depend on
anything generated from `FORALL_NS_SYMBOLS` yet because they're in a
single file you need to recompile whenever a new symbol is added. Here
I move the class definition into a separate file so this doesn't
happen.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32923637

Pulled By: albanD

fbshipit-source-id: 6e488cbfcfe2c041a99d9ff22e167dbddf3f46d7
2021-12-19 14:52:26 -08:00
Nikita Shulga
26e32988bd Revert D32596264: Codegen: TraceType only includes operators being registered
Test Plan: revert-hammer

Differential Revision:
D32596264 (e66a8ab4f5)

Original commit changeset: 2f28b62d7b99

Original Phabricator Diff: D32596264 (e66a8ab4f5)

fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12
2021-12-17 11:20:12 -08:00
Peter Bell
e66a8ab4f5 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet

Differential Revision: D32596264

Pulled By: albanD

fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
2021-12-17 10:35:05 -08:00