Commit Graph

869 Commits

Author SHA1 Message Date
PyTorch MergeBot
d28e9e145b Revert "[nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)"
This reverts commit 49c41b87a2.

Reverted https://github.com/pytorch/pytorch/pull/79147 on behalf of https://github.com/janeyx99 due to Broke 11.3 builds on trunk 49c41b87a2
2022-06-10 20:55:10 +00:00
jjsjann123
49c41b87a2 [nvfuser_upstream_push] nvfuser code base bump 060822 (#79147)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Bug fixes and minor refactor

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
4c60e7dff22a494632370e5df55c011007340d06 Add examples infrastructure for using nvFuser in a standalone program (#1725)
02a05d98334ffa580d73ccb28fdb8c577ad296fe Fix issue #1751 (#1753)
8a69aa320bd7629e1709fe5ceb7104d2c88ec84c Refactor NvFuser transpose API to match eager mode behavior (#1746)
ffdf6b7709048170d768217fcd7083fc8387f932 Remove BroadcastWithoutStride. (#1738)
02bab16035e70734450c02124f5cdaa95cf5749d Fix flipping of a boolean flag (#1745)
465d66890c8242e811224359cbdb1c2915490741 cleanup (#1744)
26d354e68720bc7dd2d3b1338ac01b707a230b6a fixing noncontig broadcast (#1742)
856b6b2f9073662dd98ca22ba6c3540e20eb1cdd Add IterDomainBuilder (#1736)
1fd974f912cd4c1e21cbd16e2abb23598d66a02f fixing warning for gcc7 (#1732)
de2740a43a869f8272c2648e091d7b8235097db9 disabling complex in python tests for #1730 (#1733)
fbbbe0a2e7c7a63e0e2719b8bfccb759b714221a fixing MSVC build (#1728)
b5feee5e2b28be688dbddc766f3c0220389c8175 Fix the fused reduction runtime kernel (#1729)
5247682dff5980bb66edf8d3aac25dea2ef2ced5 Re-entrant GroupedGridReduction (#1727)
```

RUN_TORCHBENCH: nvfuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79147
Approved by: https://github.com/davidberard98
2022-06-10 19:37:42 +00:00
Akshay Parashar
28f87b9cf9 [Static Runtime] Fix aten::clone out variant (#78297) (#78322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297

Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details.

Crashing test case:
a = tensor(3,1)			  // strides = (1,1)
B = tensor(3,2)	          // strides = (2,1)
Temp = a.expand_as(b).   // creates temp with shape as (3,2) and strides as (1,0)
temp.clone()		         // crashe on copy_ due to memoryOverlap

Fix: Disable the out variant for the expanded tensor.
- Calls native clone instead of out variant for clone dealing with expanded tensors
- Added test case for both clone variants (out and native clones)
- Increased the tensor size for memory planner test case to trigger dynamic allocation

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36672180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322
Approved by: https://github.com/mikeiovine
2022-06-02 21:06:59 +00:00
Max Podkorytov
ebfc70f37a [static-runtime] out variant for aten::mean (#78161)
Summary: As subject

Test Plan: Added unit tests

Differential Revision: D36614633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78161
Approved by: https://github.com/mikeiovine
2022-06-02 20:56:42 +00:00
Max Podkorytov
2679755bdc [static-runtime] out variant for aten::max (#78271)
Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim

Test Plan: Added a unit test

Differential Revision: D36656378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271
Approved by: https://github.com/mikeiovine
2022-05-26 23:29:27 +00:00
Hui Guo
d12bf9fd75 [static_runtime] Add auto-generated view ops (#77106)
Summary: This includes the generated view ops from D36258767.

Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36258968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77106
Approved by: https://github.com/alanwaketan, https://github.com/tenpercent
2022-05-26 03:13:59 +00:00
mikeiovine
56c23f5633 [SR] Out variant for embedding_bag_byte_unpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661

Add an out variant and wrapper in static runtime.

I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup.

Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)!

Approved by: https://github.com/tenpercent
2022-05-25 23:24:11 +00:00
mikeiovine
2ae3c59e4b [SR] Remove linear/relu fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620

Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC.

Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/)

Approved by: https://github.com/hlu1
2022-05-23 21:46:27 +00:00
Hao Lu
c60d2ef4eb [StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832)
Reviewed By: mikeiovine

Differential Revision: D36466622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832
Approved by: https://github.com/mikeiovine
2022-05-20 03:14:01 +00:00
jjsjann123
a2802ad0b9 Upstream master bump 0513 (#77471)
Updating nvfuser code base.

This should fix the indexing issue observed in https://github.com/pytorch/vision/issues/6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77471
Approved by: https://github.com/seemethere, https://github.com/eellison
2022-05-18 11:48:50 -07:00
mikeiovine
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
Mike Iovine
849984a2cd [SR] Sigmoid out variant calls fast_sigmoid (#75661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661

`fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation).
ghstack-source-id: 155604086

Reviewed By: navahgar, hlu1

Differential Revision: D35481390

fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1
(cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)
2022-05-06 00:14:30 +00:00
Mike Iovine
1fed6b7559 [SR] Eliminate extra permutes around softmax calls (#76391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391

I've seen this pattern in many important internal models:

```
x = torch.permute(a, [0, 2, 1])
y = torch.softmax(x, 2)
z = torch.permute(y, [0, 2, 1])
```

This is equivalent to
```
z = torch.softmax(x, 1)
```

The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this.
ghstack-source-id: 155466506

Test Plan: New unit tests

Reviewed By: navahgar, huiguoo

Differential Revision: D35938289

fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9
(cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)
2022-05-04 23:08:49 +00:00
Mike Iovine
cac2733af1 [SR] Codegen for aten::clamp (#76340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340

NNC kernel for `clamp` scalar case
ghstack-source-id: 155466507

Reviewed By: navahgar, huiguoo

Differential Revision: D35904019

fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f
(cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)
2022-05-04 23:08:49 +00:00
Wang, Eikan
429a80dded [NNC] Lowering function generates the output buffer with the specified stride (#76529)
Summary:
Pass stride information to lowering function to generate the output bufer with proper memory layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76529

Reviewed By: ZolotukhinM

Differential Revision: D36116712

Pulled By: IvanKobzarev

fbshipit-source-id: d3901f756b3710ecce172d6db3ecb0b7c12fb929
(cherry picked from commit b6cd53c91c01db36ea0e99167dc0ce0ae1d3aa23)
2022-05-04 20:04:22 +00:00
Hui Guo
bcddd4ab3e [Static Runtime] Add auto generated unstructured ops (#76398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76398

This diff adds the large files that include the newly generated ops from D34913736. Refer to the base diff for more details.

Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine, tenpercent

Differential Revision: D35945633

fbshipit-source-id: 53497bd5c490a57ea1521837762f740deb42bfd8
(cherry picked from commit e0fbdcb0bf09f5c192430f95f450c0a946c80074)
2022-05-04 19:34:19 +00:00
Mike Iovine
fc64dbdc01 [SR] Fuse quantized linear/relu (#75775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775

fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35633297

fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
2022-05-04 19:01:14 +00:00
Michael Suo
fb0f285638 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 20:51:34 +00:00
PyTorch MergeBot
3d7428d9ac Revert "[lint] upgrade mypy to latest version"
This reverts commit 9bf18aab94.

Reverted https://github.com/pytorch/pytorch/pull/76753 on behalf of https://github.com/suo
2022-05-03 20:01:18 +00:00
Michael Suo
9bf18aab94 [lint] upgrade mypy to latest version
Fixes https://github.com/pytorch/pytorch/issues/75927.

Had to fix some bugs and add some ignores.

To check if clean:
```
lintrunner --paths-cmd='git grep -Il .' --take MYPY,MYPYSTRICT
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76753

Approved by: https://github.com/malfet
2022-05-03 19:43:28 +00:00
Mike Iovine
b02b3f25db [SR] Quick hack to eliminate no-op slice (#75774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774

`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35632947

fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
2022-05-03 19:29:46 +00:00
Mike Iovine
3fa77fa51a [SR] Fix quantized linear tests not managing outputs (#75776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75776

The output was returned directly instead of a clone, so the output of the relevant op would not be managed.
ghstack-source-id: 154935103

Test Plan: CI

Reviewed By: navahgar

Differential Revision: D35633469

fbshipit-source-id: 7b08b7368e0349a12abf8802a4c625ffecdc5abb
(cherry picked from commit 24bed9ba4da39cff7f3b40f5e49dfded2552b373)
2022-04-27 16:38:54 +00:00
Aaron Enye Shi
09a5b075fe [libkineto] Re-enable user-annotations in PyTorch (#75601)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75601

User annotations was previously pushed down to the GPU timelines but was disabled during a refactoring some time back. This patch re-enables it in PyTorch Profiler.

Test Plan: CI Tests

Reviewed By: chaekit

Differential Revision: D34591916

Pulled By: aaronenyeshi

fbshipit-source-id: 3f4d5327b391725f4ce4e3eb16740bac2cd1c618
(cherry picked from commit 4bc07174dfef8fb2ffbefba224773a4618ed203a)
2022-04-26 23:54:22 +00:00
zengk95
1d55518198 Revert "[nnc] Strides to Tensor (#72962)"
This reverts commit 939060925f.

Fixes https://github.com/pytorch/vision/issues/5873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76332
Approved by: https://github.com/seemethere
2022-04-25 19:50:00 +00:00
Ivan Kobzarev
939060925f [nnc] Strides to Tensor (#72962)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72962

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM, cpuhrsch

Differential Revision: D34589306

Pulled By: IvanKobzarev

fbshipit-source-id: ecee5249760ecc0c8b2edb1842b90218899bc944
(cherry picked from commit 9e310c4c67389da30da89126d838ffe3864aba6f)
2022-04-23 19:35:15 +00:00
Ansha Yu
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
vfdev-5
6593d293f7 Added functorch to functional_autograd_benchmark
Description:

- Following https://github.com/pytorch/functorch/issues/497 adding an option to run benchmarks with functorch and compare to original functional autograd results.

Running the benchmark we get below table:

<details>
<summary>
Table
</summary>

```
| model | task | mean | var |
| -- | -- | -- | -- |
| resnet18 | vjp | 0.03826599195599556 | 4.3332115637895186e-06 |
| resnet18 | functorch vjp | 0.037201929837465286 | 6.139693198292662e-09 |
| resnet18 | vhp | 0.2202976644039154 | 2.8687209052691287e-08 |
| resnet18 | functorch vhp | 0.22117868065834045 | 4.108771278765744e-08 |
| resnet18 | jvp | 0.18679651618003845 | 1.832455254202614e-08 |
| resnet18 | functorch jvp | 0.05305683612823486 | 1.6690266946284282e-08 |
| fcn_resnet | vjp | 0.6071907877922058 | 7.436695454998699e-07 |
| fcn_resnet | functorch vjp | 0.6115708947181702 | 1.121692207561864e-06 |
| fcn_resnet | vhp | 3.419469118118286 | 0.020633839070796967 |
| fcn_resnet | jvp | 2.5421929359436035 | 3.1765587209520163e-06 |
| fcn_resnet | functorch jvp | 0.7628333568572998 | 1.4555752159139956e-07 |
| detr | vjp | 0.19494840502738953 | 1.9122715457342565e-05 |
| detr | vhp | 1.1664292812347412 | 0.000948643428273499 |
| detr | jvp | 0.9990308880805969 | 1.0214127541985363e-05 |
| ppl_simple_reg | vjp | 0.0007535457843914628 | 6.024204690646684e-09 |
| ppl_simple_reg | functorch vjp | 0.0016954183811321855 | 1.160151974488599e-08 |
| ppl_simple_reg | vhp | 0.0011888503795489669 | 5.93119386937957e-10 |
| ppl_simple_reg | functorch vhp | 0.0026826143730431795 | 1.6787025103326414e-08 |
| ppl_simple_reg | jvp | 0.001067900680936873 | 7.409912128331086e-10 |
| ppl_simple_reg | functorch jvp | 0.002065300941467285 | 9.710328185974504e-08 |
| ppl_simple_reg | hvp | 0.001212477684020996 | 1.974137298077494e-09 |
| ppl_simple_reg | functorch hvp | 0.00482442369684577 | 2.327668653379078e-07 |
| ppl_simple_reg | jacobian | 0.0009108781814575195 | 3.489469158068914e-09 |
| ppl_simple_reg | functorch jacobian | 0.0019866942893713713 | 1.938326299466553e-08 |
| ppl_simple_reg | hessian | 0.005053090862929821 | 3.370298600202659e-07 |
| ppl_simple_reg | functorch hessian | 0.006374978926032782 | 7.556796077778927e-08 |
| ppl_simple_reg | hessian_fwdrev | 0.0036706924438476562 | 1.996075527088692e-09 |
| ppl_simple_reg | functorch hessian_fwdrev | 0.0058908225037157536 | 7.548283775804521e-08 |
| ppl_simple_reg | hessian_revrev | 0.0015769004821777344 | 1.5754418214442012e-08 |
| ppl_simple_reg | functorch hessian_revrev | 0.0041002752259373665 | 6.713568723171193e-08 |
| ppl_simple_reg | jacfwd | 0.0018048763740807772 | 2.7375660849315864e-08 |
| ppl_simple_reg | functorch jacfwd | 0.002047991845756769 | 2.432247070416338e-09 |
| ppl_simple_reg | jacrev | 0.0009733677143231034 | 1.0078769818733235e-08 |
| ppl_simple_reg | functorch jacrev | 0.0021971464157104492 | 1.2729884701911942e-08 |
| ppl_robust_reg | vjp | 0.005820560269057751 | 8.582588151284654e-08 |
| ppl_robust_reg | functorch vjp | 0.00796132069081068 | 9.663100541956737e-09 |
| ppl_robust_reg | vhp | 0.009825301356613636 | 2.0081762386325863e-07 |
| ppl_robust_reg | functorch vhp | 0.014890861697494984 | 4.558066279969353e-07 |
| ppl_robust_reg | jvp | 0.008297419175505638 | 2.9454400873873965e-07 |
| ppl_robust_reg | functorch jvp | 0.008052706718444824 | 7.120377176761394e-08 |
| ppl_robust_reg | hvp | 0.015414690598845482 | 7.42123745567369e-07 |
| ppl_robust_reg | functorch hvp | 0.02699306048452854 | 1.4650488537881756e-06 |
| ppl_robust_reg | jacobian | 0.006207776255905628 | 1.7068457225377642e-07 |
| ppl_robust_reg | functorch jacobian | 0.009173822589218616 | 1.2214455580306094e-07 |
| ppl_robust_reg | hessian | 0.04670915752649307 | 1.4299343092716299e-05 |
| ppl_robust_reg | functorch hessian | 0.02337808534502983 | 3.0397418413485866e-06 |
| ppl_robust_reg | hessian_fwdrev | 0.024229884147644043 | 2.0425247839739313e-06 |
| ppl_robust_reg | functorch hessian_fwdrev | 0.022021746262907982 | 3.512146236062108e-07 |
| ppl_robust_reg | hessian_revrev | 0.012355780228972435 | 7.090877147675201e-07 |
| ppl_robust_reg | functorch hessian_revrev | 0.013960313983261585 | 6.326549737423193e-07 |
| ppl_robust_reg | jacfwd | 0.008112502284348011 | 2.88503088086145e-08 |
| ppl_robust_reg | functorch jacfwd | 0.008947920985519886 | 4.2070990247111695e-08 |
| ppl_robust_reg | jacrev | 0.00635871896520257 | 1.3403841592207755e-07 |
| ppl_robust_reg | functorch jacrev | 0.009123563766479492 | 2.677554675756255e-07 |
| wav2letter | vjp | 0.02078995667397976 | 2.1110793113621185e-06 |
| wav2letter | functorch vjp | 0.019202351570129395 | 9.210506135559626e-09 |
| wav2letter | vhp | 0.05997290462255478 | 8.558587616391833e-09 |
| wav2letter | functorch vhp | 0.06035261228680611 | 1.6448565842708263e-09 |
| wav2letter | jvp | 0.04507789760828018 | 1.5771547401399744e-09 |
| wav2letter | functorch jvp | 0.013057494536042213 | 3.804750292601966e-09 |
| deepspeech | vjp | 0.3648746609687805 | 1.5359055396402255e-05 |
| transformer | vjp | 0.05496881157159805 | 1.242562319703211e-08 |
| transformer | functorch vjp | 0.057835936546325684 | 2.6113376350167528e-08 |
| transformer | vhp | 0.18313491344451904 | 7.226336151688884e-08 |
| transformer | jvp | 0.13924935460090637 | 1.6989159234981344e-07 |
| multiheadattn | vjp | 0.0014708995586261153 | 3.710916729460223e-08 |
| multiheadattn | functorch vjp | 0.002404856728389859 | 2.1910574687922235e-08 |
| multiheadattn | vhp | 0.003382015274837613 | 5.3098595742540056e-08 |
| multiheadattn | functorch vhp | 0.005340623669326305 | 5.897558708056749e-08 |
| multiheadattn | jvp | 0.0027526854537427425 | 3.508620949332908e-08 |
| multiheadattn | functorch jvp | 0.0022981404326856136 | 1.327894807445773e-07 |

```

</details>

<details>
<summary>
Stdout
</summary>

```
Found functorch: 0.2.0a0+386a541
Results for model resnet18 on task vjp: 0.03826599195599556s (var: 4.3332115637895186e-06)
Results for model resnet18 on task vjp using Functorch: 0.037201929837465286s (var: 6.139693198292662e-09)
Results for model resnet18 on task vhp: 0.2202976644039154s (var: 2.8687209052691287e-08)
Results for model resnet18 on task vhp using Functorch: 0.22117868065834045s (var: 4.108771278765744e-08)
Results for model resnet18 on task jvp: 0.18679651618003845s (var: 1.832455254202614e-08)
Results for model resnet18 on task jvp using Functorch: 0.05305683612823486s (var: 1.6690266946284282e-08)
Results for model fcn_resnet on task vjp: 0.6071907877922058s (var: 7.436695454998699e-07)
Results for model fcn_resnet on task vjp using Functorch: 0.6115708947181702s (var: 1.121692207561864e-06)
Results for model fcn_resnet on task vhp: 3.419469118118286s (var: 0.020633839070796967)
Failed model using Functorch: fcn_resnet, task: vhp, Error message:
	 CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 47.46 GiB total capacity; 45.62 GiB already allocated; 5.31 MiB free; 46.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Results for model fcn_resnet on task jvp: 2.5421929359436035s (var: 3.1765587209520163e-06)
Results for model fcn_resnet on task jvp using Functorch: 0.7628333568572998s (var: 1.4555752159139956e-07)
Results for model detr on task vjp: 0.19494840502738953s (var: 1.9122715457342565e-05)
Failed model using Functorch: detr, task: vjp, Error message:
	 Cannot access data pointer of Tensor that doesn't have storage
Results for model detr on task vhp: 1.1664292812347412s (var: 0.000948643428273499)
Failed model using Functorch: detr, task: vhp, Error message:
	 Cannot access data pointer of Tensor that doesn't have storage
Results for model detr on task jvp: 0.9990308880805969s (var: 1.0214127541985363e-05)
Failed model using Functorch: detr, task: jvp, Error message:
	 Trying to use forward AD with _cdist_forward that does not support it because it has not been implemented yet.
Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation.
Results for model ppl_simple_reg on task vjp: 0.0007535457843914628s (var: 6.024204690646684e-09)
Results for model ppl_simple_reg on task vjp using Functorch: 0.0016954183811321855s (var: 1.160151974488599e-08)
Results for model ppl_simple_reg on task vhp: 0.0011888503795489669s (var: 5.93119386937957e-10)
Results for model ppl_simple_reg on task vhp using Functorch: 0.0026826143730431795s (var: 1.6787025103326414e-08)
Results for model ppl_simple_reg on task jvp: 0.001067900680936873s (var: 7.409912128331086e-10)
Results for model ppl_simple_reg on task jvp using Functorch: 0.002065300941467285s (var: 9.710328185974504e-08)
Results for model ppl_simple_reg on task hvp: 0.001212477684020996s (var: 1.974137298077494e-09)
Results for model ppl_simple_reg on task hvp using Functorch: 0.00482442369684577s (var: 2.327668653379078e-07)
Results for model ppl_simple_reg on task jacobian: 0.0009108781814575195s (var: 3.489469158068914e-09)
Results for model ppl_simple_reg on task jacobian using Functorch: 0.0019866942893713713s (var: 1.938326299466553e-08)
Results for model ppl_simple_reg on task hessian: 0.005053090862929821s (var: 3.370298600202659e-07)
Results for model ppl_simple_reg on task hessian using Functorch: 0.006374978926032782s (var: 7.556796077778927e-08)
Results for model ppl_simple_reg on task hessian_fwdrev: 0.0036706924438476562s (var: 1.996075527088692e-09)
Results for model ppl_simple_reg on task hessian_fwdrev using Functorch: 0.0058908225037157536s (var: 7.548283775804521e-08)
Results for model ppl_simple_reg on task hessian_revrev: 0.0015769004821777344s (var: 1.5754418214442012e-08)
Results for model ppl_simple_reg on task hessian_revrev using Functorch: 0.0041002752259373665s (var: 6.713568723171193e-08)
Results for model ppl_simple_reg on task jacfwd: 0.0018048763740807772s (var: 2.7375660849315864e-08)
Results for model ppl_simple_reg on task jacfwd using Functorch: 0.002047991845756769s (var: 2.432247070416338e-09)
Results for model ppl_simple_reg on task jacrev: 0.0009733677143231034s (var: 1.0078769818733235e-08)
Results for model ppl_simple_reg on task jacrev using Functorch: 0.0021971464157104492s (var: 1.2729884701911942e-08)
Results for model ppl_robust_reg on task vjp: 0.005820560269057751s (var: 8.582588151284654e-08)
Results for model ppl_robust_reg on task vjp using Functorch: 0.00796132069081068s (var: 9.663100541956737e-09)
Results for model ppl_robust_reg on task vhp: 0.009825301356613636s (var: 2.0081762386325863e-07)
Results for model ppl_robust_reg on task vhp using Functorch: 0.014890861697494984s (var: 4.558066279969353e-07)
Results for model ppl_robust_reg on task jvp: 0.008297419175505638s (var: 2.9454400873873965e-07)
Results for model ppl_robust_reg on task jvp using Functorch: 0.008052706718444824s (var: 7.120377176761394e-08)
Results for model ppl_robust_reg on task hvp: 0.015414690598845482s (var: 7.42123745567369e-07)
Results for model ppl_robust_reg on task hvp using Functorch: 0.02699306048452854s (var: 1.4650488537881756e-06)
Results for model ppl_robust_reg on task jacobian: 0.006207776255905628s (var: 1.7068457225377642e-07)
Results for model ppl_robust_reg on task jacobian using Functorch: 0.009173822589218616s (var: 1.2214455580306094e-07)
Results for model ppl_robust_reg on task hessian: 0.04670915752649307s (var: 1.4299343092716299e-05)
Results for model ppl_robust_reg on task hessian using Functorch: 0.02337808534502983s (var: 3.0397418413485866e-06)
Results for model ppl_robust_reg on task hessian_fwdrev: 0.024229884147644043s (var: 2.0425247839739313e-06)
Results for model ppl_robust_reg on task hessian_fwdrev using Functorch: 0.022021746262907982s (var: 3.512146236062108e-07)
Results for model ppl_robust_reg on task hessian_revrev: 0.012355780228972435s (var: 7.090877147675201e-07)
Results for model ppl_robust_reg on task hessian_revrev using Functorch: 0.013960313983261585s (var: 6.326549737423193e-07)
Results for model ppl_robust_reg on task jacfwd: 0.008112502284348011s (var: 2.88503088086145e-08)
Results for model ppl_robust_reg on task jacfwd using Functorch: 0.008947920985519886s (var: 4.2070990247111695e-08)
Results for model ppl_robust_reg on task jacrev: 0.00635871896520257s (var: 1.3403841592207755e-07)
Results for model ppl_robust_reg on task jacrev using Functorch: 0.009123563766479492s (var: 2.677554675756255e-07)
Results for model wav2letter on task vjp: 0.02078995667397976s (var: 2.1110793113621185e-06)
Results for model wav2letter on task vjp using Functorch: 0.019202351570129395s (var: 9.210506135559626e-09)
Results for model wav2letter on task vhp: 0.05997290462255478s (var: 8.558587616391833e-09)
Results for model wav2letter on task vhp using Functorch: 0.06035261228680611s (var: 1.6448565842708263e-09)
Results for model wav2letter on task jvp: 0.04507789760828018s (var: 1.5771547401399744e-09)
Results for model wav2letter on task jvp using Functorch: 0.013057494536042213s (var: 3.804750292601966e-09)
Results for model deepspeech on task vjp: 0.3648746609687805s (var: 1.5359055396402255e-05)
Failed model using Functorch: deepspeech, task: vjp, Error message:
	 Cannot access storage of TensorWrapper
Results for model transformer on task vjp: 0.05496881157159805s (var: 1.242562319703211e-08)
Results for model transformer on task vjp using Functorch: 0.057835936546325684s (var: 2.6113376350167528e-08)
Results for model transformer on task vhp: 0.18313491344451904s (var: 7.226336151688884e-08)
Failed model using Functorch: transformer, task: vhp, Error message:
	 bad optional access
Results for model transformer on task jvp: 0.13924935460090637s (var: 1.6989159234981344e-07)
Failed model using Functorch: transformer, task: jvp, Error message:
	 Trying to use forward AD with embedding that does not support it because it has not been implemented yet.
Please file an issue to PyTorch at https://github.com/pytorch/pytorch/issues/new?template=feature-request.yml so that we can prioritize its implementation.
Results for model multiheadattn on task vjp: 0.0014708995586261153s (var: 3.710916729460223e-08)
Results for model multiheadattn on task vjp using Functorch: 0.002404856728389859s (var: 2.1910574687922235e-08)
Results for model multiheadattn on task vhp: 0.003382015274837613s (var: 5.3098595742540056e-08)
Results for model multiheadattn on task vhp using Functorch: 0.005340623669326305s (var: 5.897558708056749e-08)
Results for model multiheadattn on task jvp: 0.0027526854537427425s (var: 3.508620949332908e-08)
Results for model multiheadattn on task jvp using Functorch: 0.0022981404326856136s (var: 1.327894807445773e-07)
```

</details>

All functorch errors are reported in its repository.

cc @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75689
Approved by: https://github.com/zou3519
2022-04-22 14:04:26 +00:00
Mike Iovine
b6a4234090 [SR] Fix broken unit test build (#76111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76111

https://github.com/pytorch/pytorch/pull/68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this
ghstack-source-id: 154335722

Test Plan: CI

Reviewed By: navahgar, ajyu

Differential Revision: D35780296

fbshipit-source-id: 0a262eb06a8d619227e5db10b6a775bf0b2e17c1
(cherry picked from commit aea6fbf9365391011df5211164e3978075d7a5cb)
2022-04-20 18:36:31 +00:00
mikeiovine
98b4a4100d [SR] Add a copy variant for fused_split_and_squeeze
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75660

The outputs of `split_and_squeeze` are passed to `VarStack` in models we care about. `VarStack` has a [fast path](https://www.internalfb.com/code/fbsource/[893193f5277184fd17f4ea3f28fe415a4df37707]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=296-298) for when all of its inputs have the same strides.

Hitting the slow path adds a ton of extra overhead - so much that it's worth it to copy in `split_and_squeeze` and force all of `VarStack`'s inputs to be contiguous so we can take advantage of the fast path.

Differential Revision: [D35513777](https://our.internmc.facebook.com/intern/diff/D35513777/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35513777/)!

Approved by: https://github.com/hlu1
2022-04-13 20:02:01 +00:00
Yulv-git
ac2d2e3a3d Fix some typos.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561
Approved by: https://github.com/albanD
2022-04-11 21:55:59 +00:00
Nikita Shulga
80ea6955af Add cuda-11.3+clang9 build workflow (take 2)
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 17:13:01 +00:00
PyTorch MergeBot
8fe43d76d5 Revert "Add cuda-11.3+clang9 build workflow"
This reverts commit 709fcc862e.

Reverted https://github.com/pytorch/pytorch/pull/75293 on behalf of https://github.com/janeyx99
2022-04-11 15:24:59 +00:00
Nikita Shulga
709fcc862e Add cuda-11.3+clang9 build workflow
To be able to detect unused captures in GPU code lambdas (as gcc does not support this diagnostic)

Remove unused opts lambda capture in `ProcessGroupMPI.cpp` and `Distributions.cu`

Fix sign-compare in nvfuser benchmark and ignore signed unsigned comparison in nvfuser tests
Fixes https://github.com/pytorch/pytorch/issues/75475 by aliasing CMAKE_CUDA_HOST_COMPILER to C_COMPILER when clang is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75293
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-04-11 14:10:57 +00:00
Mike Iovine
2f98fa9147 [SR] Do not manage tensors that escape scope via container (#74966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966

It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:

```
def forward(self, cond: bool, a, b):
    lst = []
    if cond:
        res = a + b # res should not be managed!!!
        lst.append(res)
    return lst
```

The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.

The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.

This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188

Test Plan: New unit tests cover the problematic case

Reviewed By: navahgar

Differential Revision: D35257087

fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
2022-04-07 19:57:57 +00:00
Mike Iovine
4055d1f653 [SR] Fix StaticRuntime move ctor (#74927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927

The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.

Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602

Test Plan: New unit test/CI

Reviewed By: navahgar

Differential Revision: D35228467

fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
2022-04-07 02:16:37 +00:00
Don Jang
85e163c56b [Static Runtime] Fix a bug that aten::full_like reuses a tensor that does not match arguments (#74255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255

This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.

Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.

Reviewed By: mikeiovine

Differential Revision: D34863639

fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
2022-04-05 22:30:41 +00:00
Raghavan Raman
60bda4d06b [Static Runtime] Fix handling relu in quantized linear relu dynamic op
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.

At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.

This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```

Reviewed By: mikeiovine

Differential Revision: D35366309

fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
2022-04-04 22:16:22 +00:00
Max Podkorytov
11c412a8ec [static-runtime] optimize empty if blocks at runtime (#74987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987

Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty

Test Plan:
`buck build //caffe2:torch-cpp-cpu`
`buck test //caffe2/benchmarks/static_runtime/...`
Add unit test:
`buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock`

Reviewed By: mikeiovine

Differential Revision: D35262952

fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2
(cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)
2022-04-01 05:43:33 +00:00
Norman Ponte
2e8b9c7785 [TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035

- modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change

Reviewed By: tgalkovskyi

Differential Revision: D35257017

fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f
(cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)
2022-04-01 00:35:42 +00:00
jjsjann123
873ced7cd0 Nvfuser code bump 030122 (#73627)
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
2022-03-31 08:18:22 +00:00
Mike Iovine
2ca66ffb7d [SR] Force split_and_squeeze usage via graph transformation (#74274)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274

Reviewed By: navahgar

Differential Revision: D34913889

fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458
(cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)
2022-03-29 19:13:40 +00:00
Elias Ellison
6694fdaccd Clean up profiling mode and profiling executor strategy (#73875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73875

Previously we had a few settings:
- getExecutor - which toggled between Profiling Executor and Legacy
- getGraphOptimize - if true, overrides PE/Legacy to run with simple executor (no optimizations)
and then...
- getProfilingMode - which would set PE to 0 specializtions.

The last mode is redundant with getGraphOptimize, we should just remove it and use getGraphOptimize in these cases. It would lead to potentially invalid combinations of logic - what does mean if getProfilingMode is true but getExecutor is set to false ? This would lead to a bug in specialize_autograd_zero in this case, see: https://github.com/pytorch/pytorch/blob/master/torch%2Fcsrc%2Fjit%2Fpasses%2Fspecialize_autogradzero.cpp#L93.

The tests here are failing but get fixed with the PR above it, so i'll squash for landing.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D34938130

Pulled By: eellison

fbshipit-source-id: 1a9c0ae7f6d1cfddc2ed3499a5af611053ae5e1b
(cherry picked from commit cf69ce3d155ba7d334022c42fb2cee54bb088c23)
2022-03-29 18:38:51 +00:00
Mike Iovine
3f37337ed0 [SR] Native implementation for reshape_as (#74585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585

Native static runtime for `aten::reshape_as`
ghstack-source-id: 152340038

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35060895

fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701
(cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)
2022-03-28 17:02:14 +00:00
Mike Iovine
9f2344aa40 [SR] Native implementation for select (#74568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568

Native static runtime implementation for `aten::select(Tensor, int, int)` overload
ghstack-source-id: 152340037

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35053900

fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f
(cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)
2022-03-28 17:02:14 +00:00
Mike Iovine
facdbe6d72 [SR] Native implementation for IntImplicit (#74562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562

Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do
ghstack-source-id: 152340039

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D35052997

fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42
(cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)
2022-03-28 17:02:14 +00:00
Mike Iovine
f5a9c36d0b [SR] Eliminate extra permute ops before aten::sum (#74481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481

This diff fixes an interesting performance issue related to `permute_copy`.

We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```

With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589

But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597

But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888

Reviewed By: navahgar

Differential Revision: D34992319

fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
2022-03-23 23:00:14 +00:00
Don Jang
6294a2eb7f [Static Runtime] Add out variant wrapper for aten::index_select (#74321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321

This change adds out variant wrapper for aten::index_select.

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34928012

fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283
(cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)
2022-03-16 23:43:21 +00:00
Mike Iovine
f14a0be302 [SR] Avoid allocating rstd/mean in layer_norm (#73606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606

The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116

Test Plan: Existing unit tests

Reviewed By: hlu1

Differential Revision: D34562131

fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
2022-03-15 22:07:11 +00:00
Don Jang
381c0c080f [Static Runtime] Fix a bug that aten::full reuses a tensor that does not match requested one (#73990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990

This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.

This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.

Test Plan: - Added a unittest.

Reviewed By: mikeiovine

Differential Revision: D34768718

fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
2022-03-15 16:13:52 +00:00
Don Jang
1b80f609b0 [Static Runtime] Add out variant wrapper for aten::ones_like (#73945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945

This change adds add out variant wrapper for aten::ones_like.

Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).

Reviewed By: hlu1

Differential Revision: D34727057

fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
2022-03-14 20:29:58 +00:00
Don Jang
60f22a40ef [Static Runtime] Add out variant wrapper for aten::zeros (#73946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946

This change adds an out variant wrapper for aten::zeros.

Test Plan:
- Added a unittest.

- Confirmed that the added out variant gets executed by the unittest (P485324923).

Reviewed By: mikeiovine

Differential Revision: D34725843

fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
2022-03-11 00:52:30 +00:00
Don Jang
87564a1bd7 [Static Runtime] Add native op support for aten::len (#73899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73899

This change adds native op wrappers to Static Runtime as appears in JIT (https://www.internalfb.com/code/fbsource/[429d233b9beb5e6f60df7304b792e2ff332f6ecd]/fbcode/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=613 , search for "aten::len" in that file).

Test Plan: Added unittests, "StaticRuntime.LenWith*", and confirmed they are passing with `V0307 17:39:39.817956 3516654 impl.cpp:1792] Switch to native impl for node: %2 : int = aten::len(%input.1)` per added unittest: P485159811

Reviewed By: mikeiovine

Differential Revision: D34705231

fbshipit-source-id: 916b1f8bdbc92def07bc3f98ce1db22f0f5ce206
(cherry picked from commit 66d2bb9a0a294b55e1bc87ae33f5553b1460e74b)
2022-03-10 02:57:51 +00:00
Mike Iovine
97b20b9b50 [SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704

Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D34596954

fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
2022-03-09 00:55:57 +00:00
Sergii Dymchenko
5b011fc6eb Fix Undefined variable in QInterpolateBenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130
Approved by: https://github.com/malfet
2022-03-09 00:14:15 +00:00
Don Jang
71961d37bb [Static Runtime] Add out variant wrapper for aten::ones (#73851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851

This change adds an out variant wrapper for aten::ones

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34557095

fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
2022-03-07 20:33:22 +00:00
Mike Iovine
818bf361b6 [SR] Fix a kwargs API default value bug (#73681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681

Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627

Test Plan: Added unit test to cover this case

Reviewed By: navahgar, d1jang

Differential Revision: D34588804

fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
2022-03-03 22:31:37 +00:00
Don Jang
bbc59ff2bf [Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536

Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:

1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`

2) fully instanced ones that are owned by `StaticRuntime`.

We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.

To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D33985600

fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
2022-03-02 22:33:32 +00:00
Don Jang
539acb29cd [Static Runtime] Fix a broken test & Add an out variant wrapper for mse_loss (#73574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73574

T113070663 identified a test breakage in `StaticRuntime.autogen__convert_indices_from_csr_to_coo` using `mode/opt`. This change fixes it by using a correct test value.

By generating the out variants for Static Runtime, this change also includes an out variant wrapper for `mse_loss`.

**Generating out variants for Static Runtime**
```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2]  buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
  Changed value //fbcode.sanitizer='address-undefined-dev' (was 'thread')
  ... and 13 more. See logs for all changes
Parsing buck files: finished in 0.8 sec
Downloaded 14/25 artifacts, 159.45 Kbytes, 30.0% cache miss (for updated rules)
Building: finished in 1.6 sec (100%) 52/52 jobs, 35/52 updated
  Total time: 2.5 sec
BUILD SUCCEEDED
total grouped native ops: 1501
structured grouped native ops: 540
generated grouped native ops: 137
```

Test Plan:
Ran the broken test in `mode/opt` and confirmed that it passes now.

```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck test mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --exact 'caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo' --run-disabled
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
  Changed value //project.buck_out='buck-out/opt' (was 'buck-out/dev')
  ... and 307 more. See logs for all changes
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/toolchains/platform009/build/mesa/lib/libGL.so' in library rule 'fbsource//third-party/toolchains:opengl'
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/freeglut/3.0.0/libs/x64-linux/libglut.a' in library rule 'fbsource//third-party/toolchains:GLUT'
I0301 08:28:08.884272 2239319 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000

I0301 08:30:14.751745 2261718 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000

Parsing buck files: finished in 10.1 sec
Creating action graph: finished in 6.1 sec
[RE] Metadata: Session ID=[https://fburl.com/b/reSessionID-fa0ba93b-33a1-4e6f-88f8-9f508d2c27c3]
[RE] Waiting on 0 remote actions. Completed 247 actions remotely, action cache hit rate: 0.00%.
Downloaded 13000/17457 artifacts, 463.99 Mbytes, 2.6% cache miss (for updated rules)
Building: finished in 04:16.6 min (100%) 28628/28628 jobs, 28628/28628 updated
  Total time: 04:32.9 min
More details at https://www.internalfb.com/intern/buck/build/c774ff43-5311-49ce-a677-30e3f6afdad1
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 16d9b24c-4a63-4671-84b5-690fac0ee086
Trace available for this run at /tmp/tpx-20220301-083049.472831-16d9b24c-4a63-4671-84b5-690fac0ee086/trace.log
RemoteExecution session id: reSessionID-16d9b24c-4a63-4671-84b5-690fac0ee086-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
    ✓ ListingSuccess: caffe2/benchmarks/static_runtime:static_runtime_cpptest : 285 tests discovered (0.425)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo (0.105)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
```

Reviewed By: mikeiovine

Differential Revision: D34552645

fbshipit-source-id: 36f15b0f29edcb7deb71ba8a6f66ce2532bf7c82
(cherry picked from commit 2329afd8bfc89671cfbd864414e528241e7045fc)
2022-03-02 04:36:31 +00:00
Raghavan Raman
cfd92f2d59 [Static Runtime] Add test that runs NNC fused kernels in parallel (#73256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73256

This adds a test that executes multiple Static Runtime instances in parallel
when each instances includes a fusion.
ghstack-source-id: 149787403

Test Plan:
```
buck run mode/dev-asan //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=CpuFusion.ParallelRuntimes
```

The above test results in an error: P482317015 (when parts of the fix in D34287960 (6d33852685) are backed out)

Reviewed By: mikeiovine

Differential Revision: D34404127

fbshipit-source-id: 95a267e27d74584df90841fe496f909171136981
(cherry picked from commit 57d3ad9a46a24559f6d4f4097bd1b8e0b1f6b077)
2022-02-28 17:44:45 +00:00
Don Jang
fe7e1bd1ce [Static Runtime] Add auto-generated out variant dispatchers (#72603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72603

This change adds out variant dispatchers generated by the previous diff.

The number of the out variant dispatchers generated by this diff is 133, which increases the out variant coverage by 309% (current: 43, this diff: 133 + 43 = 176). This number is expected to increase a lot as we develop this script further to cover more ops.

Test Plan:
**Unittest**
Confirmed
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
is passing.

Reviewed By: swolchok

Differential Revision: D33373928

fbshipit-source-id: 4d94d788282f3f313bb36f2f9452edecd9862246
(cherry picked from commit e4ce8b386d1fcc47b86cb9c9016a70e7a31b452c)
2022-02-28 08:39:10 +00:00
Mike Iovine
d398d4d32c [SR] Disable aten::where out variant (#73367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367

The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others:

```
def forward(self, cond_1d, x, y):
    shape = [-1] + [1] * (x.dim() - 1)
    cond = cond_1d.view(shape)
    return torch.where(cond, x, y).clone()

Condition:
01
00
[ CPUBoolType{2} ]

A:
06 -9
08 -8
[ CPULongType{2,2} ]

B:
-4 05
-5 -2
[ CPULongType{2,2} ]

Actual:
06 05
-5 -2
[ CPULongType{2,2} ]

Expected:
06 -9
-5 -2
[ CPULongType{2,2} ]
```
ghstack-source-id: 149963254

Test Plan: Unit tests exercise broadcasting

Reviewed By: d1jang

Differential Revision: D34454770

fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4
(cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)
2022-02-26 01:08:45 +00:00
Raghavan Raman
4838c6dca0 [Static Runtime] Enable all tests to run with TensorExpr fuser (#73263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73263

ghstack-source-id: 149784887

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: d1jang

Differential Revision: D34405943

fbshipit-source-id: 63e345be4bf7a57bf4e1446074e5112d4ed68515
(cherry picked from commit 69a28a6b4ea53bcd88a51c4c36d5205577d84da3)
2022-02-24 00:34:34 +00:00
Raghavan Raman
02afdd54b9 [Static Runtime] Handle fallback graphs that are generated as part of the TE Fuser (#72945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72945

ghstack-source-id: 149429754

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest — --gtest_filter=CpuFusion.FallbackGraph
```

Reviewed By: mikeiovine

Differential Revision: D34283840

fbshipit-source-id: 868bd340a50fe691797164524f2400d07998d304
(cherry picked from commit 80f60f2cc0)
2022-02-18 18:34:50 +00:00
Mike Iovine
d1c5f9e439 [JIT][SR] Introduce prim::IfThenElse (#72587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72587

This pattern frequently appears in a few graphs:

```
%result = prim::If(%condition)
  block0():
    -> (%a)
  block1():
    -> (%b)
```

This is slow, particularly in static runtime. Static runtime creates memory planners/block runners for each sub-block, which eats up a lot of memory and introduces a lot of extra overhead for this relatively simple operation.

This diff introduces a new op that replaces nodes like the above with a single op meant to act like a ternary operator:

```
%result = prim::IfThenElse(%condition, %a, %b)
```

Test Plan: New unit tests

Reviewed By: eellison

Differential Revision: D34091789

fbshipit-source-id: eb6a8c460c39b4c019a1f4ab1f3f1e5b6edc400c
(cherry picked from commit 0f1b335e5b)
2022-02-17 18:22:48 +00:00
Sergii Dymchenko
486572223b Fix command example (#72847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847

Reviewed By: malfet

Differential Revision: D34260868

Pulled By: kit1980

fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7
(cherry picked from commit c9e874c4d8)
2022-02-16 21:45:45 +00:00
Mike Iovine
d2c0c0b638 [SR] Apply all graph passes to sub-blocks (#72598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72598

Apply all optimizations to sub-blocks by replacing loops over `graph->nodes()` with loops over nodes in `DepthFirstGraphNodeIterator`
ghstack-source-id: 149155700

Test Plan: Existing unit tests

Reviewed By: d1jang

Differential Revision: D34111430

fbshipit-source-id: 015076030368bb67df24ed5892475534b8f8f272
(cherry picked from commit a4314520de)
2022-02-15 20:19:42 +00:00
jiej
2d110d514f Nvfuser code bump 2_1_2022 (#72127)
Summary:
Things changed in this PR that requires review:
1. aten/src/ATen/core/interned_strings.h
2. torch/csrc/jit/ir/alias_analysis.h : exposing createValue to allow efficient mutation
3. torch/csrc/jit/runtime/symbolic_shape_registry.cpp : added gelu/tanh/erf in registry
4. torch/jit/_script.py : throws scripting model sees autocast as decorator since it's not supported

nvfuser code update:
1. codegen improvements and performance tuning
2. integration bug fixes for shape expression logic
3. kernel segmentation update to address perf regression from horizontal fusion
4. scalar cpu tensor promotion to support inter-device operation between cpu scalar tensor and cuda tensor

Things reverted from local changes:
aten::gelu with approximation (tracked in PR: https://github.com/pytorch/pytorch/pull/61439)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/72127

Reviewed By: HamidShojanazeri

Differential Revision: D34113233

Pulled By: jbschlosser

fbshipit-source-id: b82cde32b71e324eca0ea57cb8c9f9647278ca74
(cherry picked from commit e009bc5c4e)
2022-02-15 00:43:16 +00:00
Mikhail Zolotukhin
1855b14922 [TensorExpr] Delet DimArg class. (#72390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72390

This class didn't add much value and only caused more boilerplate code.
This change removes the class and updates all the use cases with
uses of `ExprHandle`.

A side effect of this change is different names in loop variables, which
caused massive mechanical changes in our tests.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D34030296

Pulled By: ZolotukhinM

fbshipit-source-id: 2ba4e313506a43ab129a10d99e72b638b7d40108
(cherry picked from commit c2ec46a058)
2022-02-11 01:21:59 +00:00
Mike Iovine
c975b928ab [SR][easy] CPU fuser uses native control flow (#72544)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72544

Now that static runtime supports control flow, there's no need to fall back to the JIT. We get better performance with the native control flow since we avoid heap allocation/ref count bumps during stack construction.

I've left the old `prim::TensorExprDynamicGroup` around in case we need to support it in the future. I've also added native support for a few scalar ops that are used inside the control flow sub-blocks.
ghstack-source-id: 148825816

Test Plan: New unit tests

Reviewed By: d1jang

Differential Revision: D34083080

fbshipit-source-id: a7ffc0fda39ab3df3ba47e44a03d857131dc1e50
(cherry picked from commit 2ef39e0e54)
2022-02-10 18:40:39 +00:00
Don Jang
84729cef70 [Static Runtime] Fix a bug in aten::slice to honor optional arguments (#72530)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72530

This bug was revealed from a failed attempt to run a feed/story model.

Test Plan:
- This fix was tested to successfully run the failed model: P479037453
- Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34055801

fbshipit-source-id: 4a3e06bbb3b9fa78b0514c9c67aa4a0b79f46a8d
(cherry picked from commit bfa2bfb81c)
2022-02-09 17:05:45 +00:00
Mike Iovine
6c0521b919 [SR] Add native implementations for converted prim ops (#71474)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71474

The PyTorch edge team is working on promoting some prim ops to interpreter instructions (see D33398092). Since the JIT fallback ops will be unavailable soon, we need to implement these ops in static runtime.

Ops not included in this diff:
* `aten::__is__` and `aten::__isnot__`: disabled in static runtime for unrelated reasons
* `prim::NumToTensor` and `aten::__get__.Dict` already exist
ghstack-source-id: 148641179

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D33657816

fbshipit-source-id: 6d15244ae1024a56d3b25e51a433fa104ce8ee5e
(cherry picked from commit 33f8f861ff)
2022-02-08 23:25:34 +00:00
Raghavan Raman
4eb277ac61 [bench] Adding a cpp benchmark to compare performance of nnc with static and symbolic shapes (#72197)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72197

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D33951742

Pulled By: navahgar

fbshipit-source-id: 0412d61da158e98429f377469e1c331587390b14
(cherry picked from commit c043fdfc79)
2022-02-07 07:01:19 +00:00
Raghavan Raman
237e960ec9 [bench] Fix build issues with TensorExpr cpp benchmarks (#72196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72196

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33951743

Pulled By: navahgar

fbshipit-source-id: f1b36bb3ba9cd649f0dbf0911f5a9e4791089e65
(cherry picked from commit fbe5cadb5f)
2022-02-07 07:01:19 +00:00
Raghavan Raman
38f696c0cd [nnc] Add a API to unroll loops by a given factor (#72071)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72071

Reviewed By: ngimel

Differential Revision: D33946250

Pulled By: navahgar

fbshipit-source-id: 3f3f92054174620025a9d71154d006f1738953e2
(cherry picked from commit d8b53598e9)
2022-02-03 18:41:21 +00:00
Mike Iovine
cff5e22a72 [SR] Relax aten::__is__ constraint for SR enablement (#71807)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807

There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in

```
def forward(x):
    y = x.detach()
    # Should be false, but we get True
    # after our EliminateNoOps pass
    return x is y
```

Test Plan: New unit test covers this case

Reviewed By: d1jang

Differential Revision: D33783668

fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4
(cherry picked from commit 019588f4ca)
2022-02-03 12:18:46 +00:00
Mike Iovine
2d5296b0e7 [SR] Implement prim::Loop (#69838)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838

Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186483

Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...`

Reviewed By: d1jang

Differential Revision: D33049595

fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532
(cherry picked from commit 8baf1753af)
2022-02-02 19:30:50 +00:00
Mike Iovine
2aa699505d [SR] Implement prim::If (#69837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837

Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction.
ghstack-source-id: 148186475

Test Plan:
New unit tests: `buck test caffe2/benchmarks/static_runtime/...`

Accuracy test at top of stack

Reviewed By: d1jang

Differential Revision: D33045908

fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55
(cherry picked from commit de3b12dc08)
2022-02-02 19:30:50 +00:00
Mike Iovine
d2599701fd [SR] Force sub-blocks to return at least one output (#69836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836

It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned.

Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`.
ghstack-source-id: 148186453

Test Plan: Sub-blocks with no return values tested at top of stack

Reviewed By: d1jang

Differential Revision: D33050420

fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149
(cherry picked from commit c97fd07bf5)
2022-02-02 19:30:50 +00:00
Mike Iovine
238dded10f [SR] Graph pass to create owned refs of special IValues (#69835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835

`StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`.

Note that this also applies to returned values in sub-blocks that are from outer scopes.
ghstack-source-id: 148186452

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

Added a new unit test with a graph that simply returns a constant.

Tests with sub-blocks at top of stack.

Reviewed By: d1jang

Differential Revision: D33047519

fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b
(cherry picked from commit a8f89a12ee)
2022-02-02 19:30:50 +00:00
Mike Iovine
4b789df68b [SR] Add BlockRunner and handle sub-blocks (#69834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834

* Modify the `StaticModule` constructor to handle index initialization for sub-blocks.
* Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes!
ghstack-source-id: 148086245

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: d1jang

Differential Revision: D33028039

fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6
(cherry picked from commit 3a9feffd92)
2022-02-01 17:20:55 +00:00
Mike Iovine
7e6312a5df [SR] Reverse iteration order in resetMemory (#71705)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71705

This fixes a crash `resetMemory` caused by trying to access a `TensorImpl` via a borrowed `IValue` after it had already been destroyed. We need to clean up all borrows *before* we destroy the owning `IValue`, not after.
ghstack-source-id: 147688982

Test Plan:
New unit test covers this case

ICE w/ inline_cvr v0 [finishes successfully](https://www.internalfb.com/intern/unidash/dashboard/ads_infra_cost_estimation/a_metrics/?e[select_ESTIMATION_RUN_ID]=ICE_mikeiovine_16431103211c65), didn't see any nnpi errors

Reviewed By: ajyu

Differential Revision: D33725435

fbshipit-source-id: f8dd109382b5cf54df6f194f8dcb5c0812b174bb
(cherry picked from commit 31339d9d38)
2022-01-26 17:35:03 +00:00
Scott Wolchok
3a77fb244b [PyTorch][Static Runtime] Delete cleanup_activations option (#71501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501

This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it.
ghstack-source-id: 147385569

Test Plan: CI, rerun broken test from task

Reviewed By: mikeiovine

Differential Revision: D33669290

fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a
(cherry picked from commit 5e4fefb109)
2022-01-21 18:15:43 +00:00
Mike Iovine
ffdc0e23af [SR] Add various missing native ops (#71113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71113

This diff adds a variety of missing ~~out variants~~/native ops. Most of these are trivial, so I included them all in one diff.

Native ops
* `aten::mul` (list variant)
* `aten::sub` (int variant)
* `aten::add` (list variant)
* `aten::Int`

Out variants
* ~~`aten::gt`~~ (codegen will handle)
* ~~`aten::eq`~~ (codegen will handle)
ghstack-source-id: 146927552

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D33510756

fbshipit-source-id: df385958b9561955b2e866dab2e4c050abd26766
2022-01-12 18:40:31 -08:00
Scott Wolchok
10b40acbdb [PyTorch][Static Runtime] Fast aliasing in select_tensor by manual borrowing (#68122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68122

See code comments for details; in brief, we repurpose support
for borrowing `Tensor`s in `MaybeOwned` to make the `select_tensor`
output a borrowed IValue that we have to clean up manually.

If we have any other ops that always create a new reference to an
existing Tensor, we can easily apply this same optimization.
ghstack-source-id: 146482212

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

--do_profile output for local_ro (updated Dec 10):

```
swolchok@devbig032 /d/u/s/f/fbcode> tail Stable.profile.txt
First iter time: 0.989023 ms
Number of operators: 2037
Total number of managed tensors: 1597
Total number of managed output tensors: 0
Total number of unmanaged values: 2568
Number of unmanaged values requiring cleanup: 2568
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 50368 bytes
Total number of reused tensors: 1010
Total number of 'out' variant nodes/total number of nodes: 2001/2037 (98.2327%)
swolchok@devbig032 /d/u/s/f/fbcode> ttail TMCC^C
swolchok@devbig032 /d/u/s/f/fbcode> tail TMCOFastAliasing.profile.txt
First iter time: 0.994703 ms
Number of operators: 2551
Total number of managed tensors: 1146
Total number of managed output tensors: 0
Total number of unmanaged values: 4047
Number of unmanaged values requiring cleanup: 3533
Number of unmanaged values not requiring cleanup: 514
Total memory managed: 50048 bytes
Total number of reused tensors: 559
Total number of 'out' variant nodes/total number of nodes: 2001/2551 (78.4398%)
```

for local: (also Dec 10):

```
==> Stable.local.profile.txt <==
First iter time: 9.0909 ms
Number of operators: 1766
Total number of managed tensors: 1894
Total number of managed output tensors: 0
Total number of unmanaged values: 2014
Number of unmanaged values requiring cleanup: 2014
Number of unmanaged values not requiring cleanup: 0
Total memory managed: 4541440 bytes
Total number of reused tensors: 847
Total number of 'out' variant nodes/total number of nodes: 1744/1766 (98.7542%)

==> TMCOFastAliasing.local.profile.txt <==
First iter time: 7.5512 ms
Number of operators: 2378
Total number of managed tensors: 1629
Total number of managed output tensors: 0
Total number of unmanaged values: 3503
Number of unmanaged values requiring cleanup: 2891
Number of unmanaged values not requiring cleanup: 612
Total memory managed: 3949312 bytes
Total number of reused tensors: 586
Total number of 'out' variant nodes/total number of nodes: 1744/2378 (73.3389%)
```

Reviewed By: hlu1

Differential Revision: D32318674

fbshipit-source-id: a2d781105936fda2a3436d32ea22a196f82dc783
2022-01-04 22:36:13 -08:00
Scott Wolchok
4d8fc8693c [PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223

ghstack-source-id: 146482215

Test Plan:
See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421
(local is neutral: P467267554)

Reviewed By: hlu1

Differential Revision: D31776259

fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3
2022-01-04 22:36:10 -08:00
Scott Wolchok
99a10c371f [PyTorch][Static Runtime] Fix dtype changing between iterations for to() (#67394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394

ghstack-source-id: 146464294

Test Plan:
Added new test, which failed but now passes.

Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral

```
Stable, local
========================================

I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1
I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081
I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636
I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557
I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982
I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795
I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164
0

FixToDtypeChanges, local
========================================
I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66
I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257
I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847
I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632
I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787
I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706
```

Reviewed By: hlu1

Differential Revision: D31972722

fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15
2022-01-04 22:34:49 -08:00
Peter Bell
fa09099ba3 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D33336948

Pulled By: albanD

fbshipit-source-id: 4e40371592b9a5a7e7fcd1d8cecae11ffb873113
2022-01-02 13:09:19 -08:00
Mike Iovine
6a84449290 [SR] Fast path for VarStack on scalars (#70210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210

Add a fast-path for `VarStack` nodes for when the inputs are scalars.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack`

Reviewed By: hlu1

Differential Revision: D33177498

fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6
2021-12-23 10:31:17 -08:00
soulitzer
21c6de9fdc Extend autograd functional benchmarking to run vectorized tasks (#67045)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67045

To run: `python benchmarks/functional_autograd_benchmark/functional_autograd_benchmark.py --gpu -1 --model-filter=ppl    _robust_reg --num-iter 100`

```
Results for model ppl_robust_reg on task vjp: 0.0012262486852705479s (var: 2.2107682351446556e-10)
Results for model ppl_robust_reg on task vhp: 0.002099371049553156s (var: 6.906406557760647e-10)
Results for model ppl_robust_reg on task jvp: 0.001860950025729835s (var: 1.1251884146634694e-10)
Results for model ppl_robust_reg on task hvp: 0.003481731517240405s (var: 2.2713633751614282e-10)
Results for model ppl_robust_reg on task jacobian: 0.0012128615053370595s (var: 1.3687526667638394e-09)
Results for model ppl_robust_reg on task hessian: 0.009885427542030811s (var: 9.366265096844018e-09)
Results for model ppl_robust_reg on task hessian_fwdrev: 0.005268776323646307s (var: 2.4293791422991262e-09)
Results for model ppl_robust_reg on task hessian_revrev: 0.002561321249231696s (var: 7.557877101938004e-10)
Results for model ppl_robust_reg on task jacfwd: 0.002619938924908638s (var: 5.109343503839625e-10)
Results for model ppl_robust_reg on task jacrev: 0.0013469004770740867s (var: 3.1857563254078514e-09)
```
Notes:
 - We go through batched fallback for both
 - ppl_robust_reg takes 3 tensor inputs and returns a single scalar output
   - this means that jacobian is equivalent to doing vjp and vmap would not help us
   - we expect jacfwd to be slower than jacrev

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33265947

Pulled By: soulitzer

fbshipit-source-id: 14f537a1376dea7e5afbe0c8e97f94731479b018
2021-12-21 17:20:29 -08:00
Raghavan Raman
91da2d5fa1 [StaticRuntime] Refactor StaticModule to pass in sample inputs (#69473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69473

This diff refactors StaticModule and its uses to pass in sample inputs. These inputs need to be passed into the constructor because they are need to perform TensorExpr fusion before other optimizations are performed on the input graph.
ghstack-source-id: 146059041

Test Plan: buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test

Reviewed By: donaldong

Differential Revision: D32320084

fbshipit-source-id: b8bd46d442be4cc90ca60f521e0416fdb88eea60
2021-12-21 11:20:25 -08:00
Donald Dong
24f16de987 [Static Runtime] Support native op split_with_sizes (#69999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69999

This adds support for the split_with_sizes operator in static runtime by adding native operators. Those operators will have less overhead comparing to their JIT fallbacks (no dispatching, no stack constructing in runtime).

split_with_sizes can be called directly from cpp API, or in `torch.split`  when `split_sizes` is a list. This diff adds support for both use cases.

Test Plan:
- Added unit tests. Made sure the operators are used
- Benchmark
```
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/data/users/dxd/305797439_0.predictor.precompute.remote_request_only \
--method_name=user.forward --pt_cleanup_activations=1 \
--pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=500 \
--num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 \
--input_type="recordio" --pt_inputs=/data/users/dxd/305797439_0_user.inputs.recordio \
--recordio_use_ivalue_format=1 --do_profile=1 --do_benchmark=1
```

#### Before
```
Static runtime ms per iter: 3.62073. Iters per second: 276.187
0.0471904 ms.    1.31501%. aten::split_with_sizes (5 nodes)
```
#### After
```
Static runtime ms per iter: 3.44374. Iters per second: 290.382
0.0432057 ms.    1.34276%. aten::split_with_sizes (5 nodes, native)
```

Reviewed By: swolchok

Differential Revision: D33141006

fbshipit-source-id: feae34c4c873fc22d48a8ff3bf4d71c0e00bb365
2021-12-20 18:32:54 -08:00
Mike Iovine
65f54bc000 [SR] Optimize VarStack (#68750)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750

There was some room for optimization in static runtime's `prim::VarStack`:

* Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor*>`

* Skip the memory overlap check

* Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`)

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

Reviewed By: swolchok

Differential Revision: D32596934

fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0
2021-12-20 11:46:11 -08:00
Nikita Shulga
26e32988bd Revert D32596264: Codegen: TraceType only includes operators being registered
Test Plan: revert-hammer

Differential Revision:
D32596264 (e66a8ab4f5)

Original commit changeset: 2f28b62d7b99

Original Phabricator Diff: D32596264 (e66a8ab4f5)

fbshipit-source-id: 7d18c4e77ce30dd7817a95f9c39b565cb246cd12
2021-12-17 11:20:12 -08:00
Peter Bell
e66a8ab4f5 Codegen: TraceType only includes operators being registered (#68691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68691

TraceType is a sharded file, so by only including specific operator
headers, we ensure that changing one (non-method) operator only needs
one shard to be re-compiled.

This also changes all the included autograd and jit headers from
including `ATen/ATen.h` to just including `ATen/core/Tensor.h`.

Test Plan: Imported from OSS

Reviewed By: jbschlosser, malfet

Differential Revision: D32596264

Pulled By: albanD

fbshipit-source-id: 2f28b62d7b9932f30fad7daacd8ac5bb7f63c621
2021-12-17 10:35:05 -08:00
Scott Wolchok
66406ee0f7 [PyTorch][Static Runtime] Fix to() w/dtype bool (#69935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935

Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`.
ghstack-source-id: 145661358

Test Plan:
Added test for dtype bool.

Ran CMF local_ro net:

before:

```
I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67
I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11
I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93
I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57
I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76
I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45
I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4
I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43
I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04
I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8
I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367
```

after:
```
I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15
I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91
I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96
I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56
I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55
I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11
I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84
I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02
I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81
I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6
I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761
```

Reviewed By: hlu1

Differential Revision: D33106675

fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862
2021-12-15 15:26:56 -08:00
Mike Iovine
873585da2b [SR] Improve set_inputs (#69087)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087
This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`:

1. Eliminate code duplication between rvalue/lvalue overloads
2. Add type checks
3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed.
4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.|
ghstack-source-id: 145599837

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32711705

fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b
2021-12-15 09:31:19 -08:00
Mike Iovine
102684b252 [SR] Fix stack/concat bug (#68777)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68777

Fixed some cases where negative dimensions were not handled correctly

* `_stack_cpu` calls `maybe_wrap_dim`, but `_stack_cpu_out` does not. This is only problematic when `_stack_cpu_out` forwards to the serial kernel: [ref](https://www.internalfb.com/code/fbsource/[1b5af978b48f2e5d308d42b588bde3275869a57b]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=1541-1547).
* concat also needs to wrap its dim

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Added new tests to cover this case

Reviewed By: hlu1

Differential Revision: D32604623

fbshipit-source-id: 00aaa42817cd2d3e7606ce75ab5a9744645118cf
2021-12-14 16:26:27 -08:00
Donald Dong
f7294cd865 [Static Runtime] Skip ReplaceWithCopy when inputs have writters (#69819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69819

We should skip ReplaceWithCopy if the inputs to the operator can be updated during inference. For a set of tensors that share data, ReplaceWithCopy should not happen to any of them if there exists updates to any of them.

Currently, the check in place has missed some cases (suppose there exists updates, and uses <= 1). This diff addresses the missing cases by querying AliasDB.

Test Plan:
- Added test cases, including a one that is problematic before this diff
- CI

Reviewed By: mikeiovine

Differential Revision: D33052562

fbshipit-source-id: 61f87e471805f41d071a28212f2f457e8c6785e7
2021-12-14 09:39:49 -08:00
Richard Barnes
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
Mike Iovine
f87f1d08e8 [SR] assignStorageToManagedTensors returns a vector (#69568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69568

Non-empty vectors should never be passed to `assignStorageToManagedTensors` and `assignStorageToManagedOutputTensors`. Presumably, this out-variant convention was adopted to avoid move-assigning the corresponding attribtues in `MemoryPlanner`. But the cost of a vector move-assign is not high, and this function type signature is safer.

Test Plan: `buck test caffe2/bechmarks/static_runtime:static_runtime_cpptest`

Reviewed By: donaldong

Differential Revision: D32729289

fbshipit-source-id: 88f19de8eb89d8a4f1dd8bbd4d9e7f686e41888b
2021-12-09 17:01:48 -08:00
Don Jang
9aa1b3e396 [Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595

This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D32908341

fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
2021-12-09 15:11:03 -08:00
Mike Iovine
1c43b1602c [SR] Scope exit guard for memory planner deallocation (#68795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795

This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.

Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.

Test Plan:
**New unit tests**

`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D32609915

fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
2021-12-08 16:41:52 -08:00
Mike Iovine
008469c5e2 [SR] Simplify memory re-use algorithm (#68302)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302

Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.

Test Plan:
## **Re-use metrics**

`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```

**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```

Reviewed By: hlu1

Differential Revision: D32370424

fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
2021-12-07 13:25:42 -08:00
Don Jang
9663e08674 [Static Runtime] Fix a bug that aten::embedding_bag keeps cannot handle resized input tensors (#69219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219

This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts.

Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change.

Reviewed By: mikeiovine

Differential Revision: D32544399

fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29
2021-12-03 19:01:45 -08:00
Scott Wolchok
b22e4d4aea [PyTorch][SR] Add more to() tests & extend debug logging in testStaticRuntime (#67219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219

I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2.
ghstack-source-id: 144507287

Test Plan: Used during development of D31776259

Reviewed By: hlu1

Differential Revision: D31847566

fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3
2021-12-02 10:34:54 -08:00
Hao Lu
ed3b73fd4d [Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639

Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas

fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)

Reviewed By: mikeiovine

Differential Revision: D32553708

fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
2021-12-02 05:03:12 -08:00
Mike Iovine
ee4cfaa286 [SR] Add utility class to determine tensor ranges (#68284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284

Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: swolchok

Differential Revision: D32397207

fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
2021-11-19 13:10:55 -08:00
Ben Koopman
c2c859bdf2 [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31618282

Pulled By: b-koopman

fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3
2021-11-19 06:29:19 -08:00
CodemodService FBSourceClangFormatLinterBot
143491e0ad [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32484422

fbshipit-source-id: 5c836dc7d06f12e64cc4bb1e85d8fa4b62a29b85
2021-11-17 07:27:04 -08:00
jjsjann123
0dc3f829d9 Nvfuser code bump 11 5 (#67943)
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
2021-11-17 01:22:17 -08:00
Don Jang
aa9ee8d02a [Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368

Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.

However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.

This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.

This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.

Thanks to hlu1 for proposing this non-intrusive improvement idea :D

Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.

==AFTER

* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087

* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101

==BEFORE

* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18

* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453

==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:

==AFTER

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

==BEFORE

Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)

Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)

Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)

Reviewed By: swolchok

Differential Revision: D32337548

fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
2021-11-16 20:28:48 -08:00
Michael Suo
5c3529a86d [lint] small pass to make lint clean (#68367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367

- bmm_test.py was using syntax not allowed in 3.6
- Some suppressions were not placed on the correct line.

With this file,
```
lintrunner --paths-cmd='git grep -Il .'
```
passes successfully.

Test Plan: Imported from OSS

Reviewed By: janeyx99, mrshenli

Differential Revision: D32436644

Pulled By: suo

fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2
2021-11-16 10:27:00 -08:00
Scott Wolchok
639258499f [PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935

Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112

Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.

Previous diff, local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```

This diff, local:

```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)

Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.

Reviewed By: hlu1

Differential Revision: D32216573

fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
2021-11-16 10:21:12 -08:00
Scott Wolchok
6acde23bec [PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934

This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113

Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.

Previous diff, CMF local:

```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```

This diff, CMF local:

```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```

Perf looks neutral; see next diff (D32216573) test plan for details.

Reviewed By: hlu1

Differential Revision: D32190751

fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
2021-11-16 10:19:50 -08:00
Don Jang
9cb65df79f [Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939

With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via  `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()`  the client will experience a crash (intended behavior not to  leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).

Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.

This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls  `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.

A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.

Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.

Reviewed By: swolchok

Differential Revision: D32219731

fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
2021-11-11 17:31:07 -08:00
Hao Lu
47bc47f2b9 [SR] Add runtime check to correct bad schema alias info (#67825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825

The comment explains how it works.

Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619

# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424

# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765

# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658

# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```

Reviewed By: d1jang

Differential Revision: D32124812

fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
2021-11-10 19:35:11 -08:00
Mike Iovine
ecd5b1a8d4 [SR] Native implementation for aten::split (#67476)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31994040

fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
2021-11-10 10:23:03 -08:00
Hao Lu
1b2a366932 [SR] Enforce checks for resizing of the internal buffer in MemoryPlanner in unit tests (#67941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941

I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests.

From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in.

Test Plan:
Facebook
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```

Reviewed By: swolchok

Differential Revision: D32196204

fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71
2021-11-09 16:07:40 -08:00
David Berard
b546cdf401 [SR] Out variant for prim::NumToTensor (#67856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856

Returns a tensor constructed from scalar input

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Ran
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=*NumToTensorScalar* --v=1
```
and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`.

Reviewed By: mikeiovine

Differential Revision: D32014194

fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542
2021-11-08 09:02:58 -08:00
Mike Iovine
5bc89275dd [SR] Eliminate no-ops (#67437)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437

Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.

Test Plan: `buck test caffe2/test:jit -- test_freezing`

Reviewed By: hlu1

Differential Revision: D31980843

fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
2021-11-08 08:42:33 -08:00
Bert Maher
4b084bc832 Benchmarks for various fusers (#67622)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67622

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D32171063

Pulled By: bertmaher

fbshipit-source-id: 40d3a7adcc52aba3b051e382ec5ec4ee7e43d81b
2021-11-04 18:57:17 -07:00
Hao Lu
938bab0bfd [PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865

- Add int version of vectorized PrefixSum
- Use unaligned load/store instructions
- Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan

Test Plan:
```
buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 numactl -m 0 -C 5 \
./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
```

For full benchmark results, see P465274613

```
PrefixSumBench/LocalInt/64                            57 ns         56 ns   12414048 GB/s=9.06239G/s
PrefixSumBench/LocalInt/256                          221 ns        221 ns    3160853 GB/s=9.28635G/s
PrefixSumBench/LocalInt/1024                         818 ns        817 ns     857922 GB/s=10.0235G/s
PrefixSumBench/LocalInt/4096                        3211 ns       3210 ns     217614 GB/s=10.2093G/s
PrefixSumBench/LocalInt/16384                      12806 ns      12804 ns      54805 GB/s=10.2364G/s
PrefixSumBench/LocalInt/65536                      51115 ns      51079 ns      13741 GB/s=10.2643G/s
PrefixSumBench/LocalInt/262144                    205974 ns     205912 ns       3401 GB/s=10.1847G/s
PrefixSumBench/LocalInt/1048576                   829523 ns     828859 ns        845 GB/s=10.1207G/s
PrefixSumBench/LocalIntAVX2/64                        45 ns         45 ns   15568113 GB/s=11.3549G/s
PrefixSumBench/LocalIntAVX2/256                      208 ns        208 ns    3371174 GB/s=9.86913G/s
PrefixSumBench/LocalIntAVX2/1024                     893 ns        892 ns     783154 GB/s=9.18629G/s
PrefixSumBench/LocalIntAVX2/4096                    3618 ns       3613 ns     193834 GB/s=9.06838G/s
PrefixSumBench/LocalIntAVX2/16384                  14416 ns      14411 ns      48564 GB/s=9.09543G/s
PrefixSumBench/LocalIntAVX2/65536                  57650 ns      57617 ns      12156 GB/s=9.09952G/s
PrefixSumBench/LocalIntAVX2/262144                230855 ns     230612 ns       3035 GB/s=9.09386G/s
PrefixSumBench/LocalIntAVX2/1048576               924265 ns     923777 ns        758 GB/s=9.08077G/s
PrefixSumBench/LocalIntAVX512/64                      23 ns         23 ns   24876551 GB/s=22.0697G/s
PrefixSumBench/LocalIntAVX512/256                     95 ns         95 ns    7387386 GB/s=21.556G/s
PrefixSumBench/LocalIntAVX512/1024                   435 ns        435 ns    1609682 GB/s=18.8425G/s
PrefixSumBench/LocalIntAVX512/4096                  1815 ns       1815 ns     385462 GB/s=18.0561G/s
PrefixSumBench/LocalIntAVX512/16384                 7479 ns       7476 ns      93660 GB/s=17.5335G/s
PrefixSumBench/LocalIntAVX512/65536                30171 ns      29879 ns      23430 GB/s=17.5468G/s
PrefixSumBench/LocalIntAVX512/262144              125805 ns     125631 ns       5570 GB/s=16.6929G/s
PrefixSumBench/LocalIntAVX512/1048576             504216 ns     503983 ns       1384 GB/s=16.6446G/s
PrefixSumBench/ExclusiveScanIntAVX512/64              23 ns         23 ns   30058295
PrefixSumBench/ExclusiveScanIntAVX512/256            101 ns        101 ns    7398498
PrefixSumBench/ExclusiveScanIntAVX512/1024           435 ns        434 ns    1403877
PrefixSumBench/ExclusiveScanIntAVX512/4096          1979 ns       1978 ns     354016
PrefixSumBench/ExclusiveScanIntAVX512/16384         7828 ns       7819 ns      89551
PrefixSumBench/ExclusiveScanIntAVX512/65536        31206 ns      31192 ns      22408
PrefixSumBench/ExclusiveScanIntAVX512/262144      130106 ns     130023 ns       5388
PrefixSumBench/ExclusiveScanIntAVX512/1048576     525515 ns     524976 ns       1244
```

Reviewed By: navahgar, swolchok

Differential Revision: D32011740

fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063
2021-11-04 14:00:19 -07:00
Bin Wen
1baed45c6b [fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663

mostly follow the example of quantized::linear (D28428734 (4d7abdbdad)) to enable out-variant for quantized::linear_dynamic_fp16.

Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d

Test Plan:
buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench

  sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable

  MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 | pastry

before: P465201159

  0.929067 ms.     31.808%. quantized::linear_dynamic_fp16 (16 nodes)
  0.921679 ms.    31.7324%. quantized::linear_dynamic_fp16 (16 nodes)
  0.919127 ms.    31.7404%. quantized::linear_dynamic_fp16 (16 nodes)

after: P465203015

  0.90898 ms.    31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.9127 ms.      30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant)
  0.879148 ms.    31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant)

unit test logic refers https://fburl.com/code/vv0rry13

  buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: hlu1

Differential Revision: D32001168

fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3
2021-11-03 22:39:04 -07:00
Hao Lu
89b02fc70b [StaticRuntime][Easy] Correct typos in test_static_runtime (#67739)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67739

Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: mikeiovine

Differential Revision: D32125879

fbshipit-source-id: bd989e5088edff87624b858bd9045dfe9da3fbe7
2021-11-03 13:24:46 -07:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Scott Wolchok
82f7f8d471 [PyTorch] Adopt IValue::toTupleRef() where obvious (#65505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505

Generated with

`fastmod -m 'toTuple\(\)(\s*)->' 'toTupleRef()${1}.'`

, followed by

`fastmod '(std::move\(.*)toTupleRef\(\).' '${1}toTuple()->'`

to unbreak 2 callsites.
ghstack-source-id: 142065835

Test Plan: CI

Reviewed By: gchanan

Differential Revision: D31131025

fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34
2021-11-02 10:22:18 -07:00
Mike Iovine
39ad7b670e [SR] Native implementation for aten::squeeze (#67441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31992093

fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
2021-11-01 08:22:57 -07:00
Mike Iovine
0d7cf825fc [SR] Drop support for aten::__is__ and aten::__isnot__ (#67550)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550

`aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations.

Some examples:

- Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this:
```
def forward(x):
    y = x.detach()
    return x is y
```
We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`.

- `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards
```
def forward(x):
    y = x.to(x.dtype)
    return x is y
```

- And many more, `FuseListUnpack` can break too

Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D32022584

fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e
2021-11-01 04:45:14 -07:00
Mike Iovine
354363b57a [SR] Native implementation for aten::size (#67346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: d1jang

Differential Revision: D31965159

fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
2021-10-28 14:18:17 -07:00
Mike Iovine
afb8434440 [SR] Native implementation for aten::view (#67341)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341

Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31962589

fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435
2021-10-28 13:37:46 -07:00
Bin Wen
6900aacf54 [fbcode] Fix operator_benchmark with jit mode (#67382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382

two simple updates:

* fix running benchmark with --use_jit. Previously will fail with error

  torch.jit.frontend.UnsupportedNodeError: import statements aren't supported:
  File "/proc/self/fd/3/bmm_test.py", line 9
  def __invoke_main():
    import ctypes
    ~~~~~~ <--- HERE
    import ctypes.util
    import errno

* add matmul to bmm benchmark as D31837588

Test Plan:
buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test --  --forward_only=True --mkl_num_threads=1 --omp_num_threads=1
 --use_jit=True

Reviewed By: ShijunK

Differential Revision: D31960528

fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f
2021-10-28 08:48:10 -07:00
Mike Iovine
7da9c4ed2e [SR] NNC out variant for aten::where (#67255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255

Add an out variant for `aten::where`.

Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well.

Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: navahgar

Differential Revision: D31923886

fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837
2021-10-28 06:48:22 -07:00
Hao Lu
9ebc6357b3 [SR] Vectorize int version of fmod (#67313)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313

Reviewed By: swolchok

Differential Revision: D31889868

fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd
2021-10-27 17:02:53 -07:00
Mike Iovine
a0495b3cdb [SR] Remove unused operator() overload (#67001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001

The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.

This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.

Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`

`buck test caffe2/test:static_runtime`

Reviewed By: hlu1

Differential Revision: D31821990

fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
2021-10-25 08:18:58 -07:00
Mike Iovine
f2582a59d0 [SR] Add rvalue overload for operator() (#66648)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648

Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:

1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```

The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.

Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.

Reviewed By: hlu1

Differential Revision: D31659973

fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
2021-10-22 10:51:47 -07:00
Aditya Pillai
40a8a50913 Add static_runtime::fused_equally_split (#2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881

Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator.

Test Plan:
```
adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s
❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
and sandcastle
strange_what_could_go_wrong

Reviewed By: mikeiovine

Differential Revision: D31742293

fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39
2021-10-22 10:26:49 -07:00
Don Jang
18bbc4c2b7 [Static Runtime] Fix a bug in aten::index (#66940)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940

`aten::index`'s schema is as follows:

```
"aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
```

The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor.

Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`.

Reviewed By: hlu1

Differential Revision: D31712145

fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3
2021-10-20 15:51:21 -07:00
lezcano
0974215c4d Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181

This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.

It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31692896

Pulled By: anjali411

fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
2021-10-18 13:02:25 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Vasiliy Kuznetsov
d802877dfa speed up quantized interpolate for channels last (#66525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525

This should solve https://github.com/pytorch/pytorch/issues/60015

There were two `q_zero_point()` accesses inside a for loop which was
expensive. Moving them to before the loop sped things up 10x for a
microbenchmark.

Test Plan:
```
// comment out benchmarks unrelated to original issue, for simplicity
cd benchmarks/operator_benchmark
python -m pt.qinterpolate_test

// before: 2994 us
// after: 324 us
// full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31592422

fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459
2021-10-14 08:11:26 -07:00
Hao Lu
6634570aef [SR] Fix bug in ValueGroup (#66470)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470

Reviewed By: d1jang

Differential Revision: D31566348

fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3
2021-10-13 19:26:38 -07:00
Scott Wolchok
d30397d42a [PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429

The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722

Test Plan:
CI

I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.

Reviewed By: hlu1

Differential Revision: D31096042

fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
2021-10-13 19:13:20 -07:00
Mike Iovine
37db650c9c [Static Runtime] Clone test does not use uninitialized memory (#66557)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557

The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31611961

fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8
2021-10-13 14:02:34 -07:00
Don Jang
736fa09a9a [Static Runtime] Manage output tensors (#65515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515

This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:

- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output  `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for  inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
-  The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.

This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.

Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.

- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).

Reviewed By: hlu1

Differential Revision: D31049221

fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
2021-10-11 09:50:54 -07:00
Don Jang
416f593080 [Static Runtime] Group graph nodes into input aliases & output aliases (#65517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517

This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:

- input_aliases:  values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.

`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.

Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.

Reviewed By: hlu1

Differential Revision: D30940955

fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
2021-10-07 14:35:12 -07:00
Mike Iovine
d5f64afc38 [Static Runtime] Support aten::to.prim_dtype overload (#64928)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928

Added support this overload of `aten::to`:
```
aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a|b)
```

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to`

Reviewed By: hlu1

Differential Revision: D30901398

fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb
2021-10-07 10:22:44 -07:00
Mike Iovine
6d7fab5929 [Static Runtime][easy] Clone scripts do not use aten::add (#66161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161

`aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31406764

fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e
2021-10-06 12:37:39 -07:00
Alexandr Guzhva
b8e1999253 [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183

Add a GPU benchmark for fakeQuant, similar to #65241
ghstack-source-id: 139810414

Test Plan: https://pxl.cl/1QjJM

Reviewed By: b-koopman

Differential Revision: D31288158

fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d
2021-10-06 08:07:42 -07:00
Mike Iovine
ed50fa2513 [Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849

Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D31282901

fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9
2021-10-04 20:46:07 -07:00
Nikita Shulga
4c4525fa5c Compile without -Wno-unused-variable (take 2) (#66041)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Do not delete `caffe2::OperatorBase::Output` calls as they have side effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041

Reviewed By: ngimel

Differential Revision: D31360142

Pulled By: malfet

fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
2021-10-04 20:39:39 -07:00
Don Jang
89ed9bdaee [Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516

This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.

Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.

This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (3fb33b38b9)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.

This change fixes the root cause from `_embedding_bag_cpu_impl_out`  by replacing alias creation with copying.

Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
 1. profile run
 2. run using the profile to allocate managed tensors
 3. reuse the managed tensors -- newly added

Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.

Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).

- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.

Reviewed By: hlu1

Differential Revision: D31104345

fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f
2021-10-03 15:10:58 -07:00
Nikita Shulga
e4ee5ca698 Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable
Test Plan: revert-hammer

Differential Revision:
D31326599 (a6280ab653)

Original commit changeset: 924155f1257a

fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf
2021-10-01 20:40:47 -07:00
Nikita Shulga
a6280ab653 Compile without -Wno-unused-variable (#65954)
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`

Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954

Reviewed By: ngimel

Differential Revision: D31326599

Pulled By: malfet

fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
2021-10-01 17:40:47 -07:00
Scott Wolchok
ffede499b2 [PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499

When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027

Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).

Before:

I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971

mean: 7.144172, stddev: 0.1283

After:

I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85

mean: 6.800632, stddev: 0.013227

Looks like about a 5.3% improvement.

Reviewed By: hlu1

Differential Revision: D31125492

fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
2021-10-01 12:13:33 -07:00
Vasiliy Kuznetsov
e3af4be963 pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833

Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks`
folder.

```
find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31275963

fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116
2021-10-01 06:17:36 -07:00
Mikhail Zolotukhin
3a0165da49 [TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551

Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.

Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D31148926

Pulled By: ZolotukhinM

fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
2021-09-30 22:56:18 -07:00
Raghavan Raman
8f3983254b [MicroBench] Added a micro benchmark for prefix sum (#65790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790

Here are the results of the benchmark:

* ATen - version that calls `at::cumsum`
* NNC - a simple prefix-sum loop implemented in NNC (not vectorized)
* Local - a C++ implementation of the simple prefix-sum loop
* LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2
* LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512.

The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20.

```
$ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench
Run on (36 X 1601 MHz CPU s)
2021-09-28 23:13:12
------------------------------------------------------------------------------------------
Benchmark                                   Time           CPU Iterations UserCounters...
------------------------------------------------------------------------------------------
PrefixSumBench/ATen/64                   1289 ns       1289 ns     543199 GB/s=397.069M/s
PrefixSumBench/ATen/256                  1867 ns       1867 ns     374232 GB/s=1096.8M/s
PrefixSumBench/ATen/1024                 4169 ns       4169 ns     167889 GB/s=1.9649G/s
PrefixSumBench/ATen/4096                14137 ns      14136 ns      49266 GB/s=2.31806G/s
PrefixSumBench/ATen/16384               49887 ns      49883 ns      13988 GB/s=2.6276G/s
PrefixSumBench/ATen/65536              193742 ns     193686 ns       3628 GB/s=2.7069G/s
PrefixSumBench/ATen/262144             764803 ns     764774 ns        917 GB/s=2.74219G/s
PrefixSumBench/ATen/1048576           3040653 ns    3040277 ns        231 GB/s=2.75916G/s
PrefixSumBench/Local/64                   586 ns        586 ns    1197003 GB/s=873.244M/s
PrefixSumBench/Local/256                 1077 ns       1077 ns     646265 GB/s=1.90143G/s
PrefixSumBench/Local/1024                3050 ns       3050 ns     229458 GB/s=2.68579G/s
PrefixSumBench/Local/4096               11910 ns      11910 ns      58953 GB/s=2.75132G/s
PrefixSumBench/Local/16384              43204 ns      43202 ns      16081 GB/s=3.03393G/s
PrefixSumBench/Local/65536             167966 ns     167966 ns       4154 GB/s=3.12139G/s
PrefixSumBench/Local/262144            667631 ns     667613 ns       1048 GB/s=3.14127G/s
PrefixSumBench/Local/1048576          2654785 ns    2654631 ns        264 GB/s=3.15999G/s
PrefixSumBench/NNC/64                     642 ns        642 ns    1095277 GB/s=797.442M/s
PrefixSumBench/NNC/256                   1139 ns       1138 ns     617214 GB/s=1.799G/s
PrefixSumBench/NNC/1024                  3103 ns       3103 ns     225531 GB/s=2.63979G/s
PrefixSumBench/NNC/4096                 12053 ns      12052 ns      58084 GB/s=2.71883G/s
PrefixSumBench/NNC/16384                43227 ns      43225 ns      16192 GB/s=3.03231G/s
PrefixSumBench/NNC/65536               168065 ns     168056 ns       4153 GB/s=3.11972G/s
PrefixSumBench/NNC/262144              668974 ns     668921 ns       1045 GB/s=3.13513G/s
PrefixSumBench/NNC/1048576            2657464 ns    2657341 ns        263 GB/s=3.15677G/s
PrefixSumBench/LocalAVX2/64               523 ns        523 ns    1351308 GB/s=979.537M/s
PrefixSumBench/LocalAVX2/256              755 ns        755 ns     927762 GB/s=2.71159G/s
PrefixSumBench/LocalAVX2/1024            1759 ns       1759 ns     400355 GB/s=4.65609G/s
PrefixSumBench/LocalAVX2/4096            6708 ns       6706 ns     103959 GB/s=4.88649G/s
PrefixSumBench/LocalAVX2/16384          22143 ns      22142 ns      31229 GB/s=5.91951G/s
PrefixSumBench/LocalAVX2/65536          83649 ns      83642 ns       8350 GB/s=6.26828G/s
PrefixSumBench/LocalAVX2/262144        330433 ns     330427 ns       2133 GB/s=6.34679G/s
PrefixSumBench/LocalAVX2/1048576      1302301 ns    1302179 ns        537 GB/s=6.44198G/s
PrefixSumBench/LocalAVX512/64             474 ns        474 ns    1459151 GB/s=1080.8M/s
PrefixSumBench/LocalAVX512/256            576 ns        576 ns    1217442 GB/s=3.55524G/s
PrefixSumBench/LocalAVX512/1024           994 ns        994 ns     703387 GB/s=8.24434G/s
PrefixSumBench/LocalAVX512/4096          3642 ns       3641 ns     190646 GB/s=8.99857G/s
PrefixSumBench/LocalAVX512/16384        10140 ns      10140 ns      68947 GB/s=12.9267G/s
PrefixSumBench/LocalAVX512/65536        35739 ns      35736 ns      19567 GB/s=14.6711G/s
PrefixSumBench/LocalAVX512/262144      156415 ns     156413 ns       4467 GB/s=13.4078G/s
PrefixSumBench/LocalAVX512/1048576     613952 ns     613876 ns       1144 GB/s=13.665G/s
```

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D31253849

Pulled By: navahgar

fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4
2021-09-30 14:44:52 -07:00
Mike Iovine
5f7ab7be6f [Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741

This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing.

Reviewed By: hlu1

Differential Revision: D31234944

fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4
2021-09-29 08:04:20 -07:00
Philip Meier
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
Kushashwa Ravi Shrimali
4752453d27 [Structured Kernels] Port for baddbmm and bmm (#64805)
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.

Issue tracker: https://github.com/pytorch/pytorch/issues/55070

cc: ysiraichi ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805

Reviewed By: gchanan

Differential Revision: D31134454

Pulled By: ezyang

fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
2021-09-28 11:07:31 -07:00
Ben Koopman
6a6ee92e36 [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241

Test Plan: Imported from OSS

Reviewed By: jingsh

Differential Revision: D31150087

Pulled By: b-koopman

fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727
2021-09-27 16:01:49 -07:00
Mike Iovine
ef9e560796 [Static Runtime] Add aten::remainder out variant (#64967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967

Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads.

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder`

Reviewed By: d1jang

Differential Revision: D30915469

fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9
2021-09-24 07:51:39 -07:00
Raghavan Raman
31584d065e [Static Runtime] Added NNC implementation for signed log1p kernel. (#65387)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387

Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.

Also, added a SR microbenchmark for this kernel which shows the performance improvement.

Without fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                             1953 ns       1953 ns     358746
BM_signed_log1p/64                             2049 ns       2049 ns     342145
BM_signed_log1p/512                            3291 ns       3291 ns     214342
BM_signed_log1p/4096                          15559 ns      15559 ns      44420
BM_signed_log1p/32768                        101936 ns     101935 ns       6843
BM_signed_log1p/65536                        194792 ns     194789 ns       3615
```

With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16                              369 ns        369 ns    1896179
BM_signed_log1p/64                              497 ns        497 ns    1406995
BM_signed_log1p/512                            1618 ns       1618 ns     430209
BM_signed_log1p/4096                          11327 ns      11326 ns      61463
BM_signed_log1p/32768                         84099 ns      84086 ns       8325
BM_signed_log1p/65536                        166531 ns     166510 ns       4186
```

This clearly shows >15% improvement in performance of this kernel with NNC fusion.

On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
  without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
  with NNC fusion: `0.55%`

Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)

```
get 57220 prediction values
get 57220 prediction values
max_error:  0  total:  0
```

Reviewed By: hlu1

Differential Revision: D30609492

fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
2021-09-22 15:53:33 -07:00
Rodrigo Berriel
a0dea074b2 Remove .data from benchmarks and tensorboard (#65389)
Summary:
Related to https://github.com/pytorch/pytorch/issues/30987 and https://github.com/pytorch/pytorch/issues/33628. Fix the following tasks:

- Remove the use of `.data` in all our internal code:
  - [x] `benchmarks/`
  - [x] `torch/utils/tensorboard/`

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 albanD gchanan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65389

Reviewed By: soulitzer

Differential Revision: D31093464

Pulled By: albanD

fbshipit-source-id: 3a9c8834fd544a59a1cc2b930ae538fd1d46b232
2021-09-22 11:16:59 -07:00
jiej
127c9402d0 Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137)
Summary:
This reverts commit 03389dc851.

Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745
Fixes the windows build failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d
2021-09-22 04:54:51 -07:00
Hao Lu
ce101fed02 [PyPer] copy-free freeze_module (#65118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118

Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase.

Reviewed By: eellison, movefast1990

Differential Revision: D30955053

fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0
2021-09-20 17:25:10 -07:00
Mike Iovine
99e4ab5d44 [Static Runtime] Implement and enable variadic tuple unpack (#64934)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934

Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```

The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.

Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`

Reviewed By: d1jang

Differential Revision: D30872109

fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
2021-09-20 10:36:11 -07:00
Don Jang
ae00075ac7 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123

This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.

```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57   return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57    auto sr = getStaticRuntime();
Sep 15 15:39:57         ^    ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57   std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57                                  ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57       unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57       ^
Sep 15 15:39:57 2 errors generated.
```

This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).

Original Summary:

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: - Confirm that OSS build went well (See External Tests section).

Reviewed By: mikeiovine

Differential Revision: D30983292

fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
2021-09-17 13:32:01 -07:00
albanD
473e55d5b2 Use classmethods for overrides (#64841)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64841

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D30991424

Pulled By: albanD

fbshipit-source-id: 551e2119768f3a4292713f3bfa83930f5506adbd
2021-09-17 08:32:49 -07:00
Don Jang
8241193d76 [Static Runtime] Introduce static_runtime::dict_unpack (#64771)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771

Test Plan:
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict`
- Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict`
- TBD: Perf impact measurement

Reviewed By: mikeiovine

Differential Revision: D30685083

fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9
2021-09-16 23:25:13 -07:00
Eli Uriegas
03389dc851 Revert D30752939: [pytorch][PR] nvfuser update
Test Plan: revert-hammer

Differential Revision:
D30752939 (cfaecaf40b)

Original commit changeset: ce122e80f01b

fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2
2021-09-15 17:38:47 -07:00
jiej
cfaecaf40b nvfuser update (#63745)
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:

- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).

To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.

internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`

updates affecting integration:

1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745

Reviewed By: saketh-are

Differential Revision: D30752939

Pulled By: malfet

fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
2021-09-15 14:42:55 -07:00
Don Jang
3fb33b38b9 [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
2021-09-15 08:38:05 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Eddie Ren
9c73a48ecf ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707

Use torch.randn instead of torch.from_numpy to generate the tensor

Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test

Reviewed By: jingsh

Differential Revision: D30817302

fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672
2021-09-13 06:47:35 -07:00
Raghavan Raman
2cc9778495 [MicroBench] Added a log_vml version of the signed log1p kernel (#64205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205

The log_vml version of the micro-bench is over **2x** faster than the log1p version. Here are the perf numbers:

```
---------------------------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------
SignedLog1pBench/ATen/10/1467           45915 ns        45908 ns        14506 GB/s=2.5564G/s
SignedLog1pBench/NNC/10/1467            40469 ns        40466 ns        17367 GB/s=2.9002G/s
SignedLog1pBench/NNCLogVml/10/1467      19560 ns        19559 ns        35902 GB/s=6.00016G/s
```

Thanks to bertmaher for pointing this out.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30644716

Pulled By: navahgar

fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19
2021-09-10 16:49:06 -07:00
Eddie Ren
3fbb49e75d Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647

Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim.

Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test```

Reviewed By: jingsh

Differential Revision: D30770085

fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e
2021-09-10 16:49:02 -07:00
Mike Iovine
616fd9219d [Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209

Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
    %0 : Tensor = aten::sign(%input)
    %1 : Tensor = aten::abs(%input)
    %2 : Tensor = aten::log1p(%1)
    %res : Tensor = aten::mul(%0, %2)
    return (%res)
```
Into a single op:
```
graph(%input):
    %res : Tensor = static_runtim::signed_log1p(%input)
    return (%res)
```

The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.

Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.

Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`

Test passed with new graph pass disabled and enabled.

Reviewed By: hlu1

Differential Revision: D30559929

fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
2021-09-02 08:31:40 -07:00
Ray Peng
09e610e36d [Static Runtime] Out version for softmax (#64243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243

Test Plan:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
...
V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1)
...
[       OK ] StaticRuntime.IndividualOps_Softmax (803 ms)
```

Reviewed By: hlu1

Differential Revision: D30656149

fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c
2021-08-31 18:33:26 -07:00
Harut Movsisyan
3c15822f5f [Static Runtime] Implement aten::nonzero out variant (#64126)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126

Test Plan:
Confirm out variant is called:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30617729

fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b
2021-08-31 00:51:15 -07:00
Harut Movsisyan
1f16c22dc8 [Static Runtime] Implement aten::cumsum out variant (#64159)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: mikeiovine

Differential Revision: D30622819

fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
2021-08-30 16:18:22 -07:00
Harut Movsisyan
e24c3644d8 [Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157

UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.

Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30598574

fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
2021-08-30 09:42:38 -07:00
Raghavan Raman
dc4fd3bdda [MicroBench] Added a micro benchmark for a signed log1p kernel. (#64032)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64032

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D30579198

Pulled By: navahgar

fbshipit-source-id: a53d68225fba768b26491d14b535f8f2dcf50c0e
2021-08-30 09:27:51 -07:00
Harut Movsisyan
8af1407eab [Static Runtime] Out version for torch.linalg.norm (#64070)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070

Test Plan:
Confirm out variant is called for both versions:

```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```

Reviewed By: d1jang

Differential Revision: D30595816

fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
2021-08-29 21:00:11 -07:00
Don Jang
9f1f22b9bc [Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081

This change add an out variant of `quantized::embedding_bag_byte_prepack`.

Test Plan:
- Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`.

- Observed

```
V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input)
```

Reviewed By: hlu1

Differential Revision: D30504216

fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2
2021-08-27 10:53:23 -07:00
Harut Movsisyan
f2c47cf4db [Static Runtime] Out version for fmod (#64046)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046

Test Plan:
Confirm out variant is used:
```
> //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1

V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1)
```

Reviewed By: mikeiovine

Differential Revision: D30581228

fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74
2021-08-27 03:05:06 -07:00
Don Jang
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
Don Jang
cbfec02007 [Static Runtime] Add native op for aten::expand_as (#64024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024

`aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime.

Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs`

Reviewed By: hlu1

Differential Revision: D30546851

fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0
2021-08-26 13:05:53 -07:00
Hao Lu
6fa646ad54 [StaticRuntime] Fix bug in HasInplaceOp (#63842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842

Reviewed By: mikeiovine

Differential Revision: D30506914

fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852
2021-08-24 17:07:45 -07:00
Harut Movsisyan
956c8fa01e Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654

Test Plan:
```
> buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 27.970

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 41.830

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 499.114

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 6.268

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 12.676

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 438.219

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 7.657

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 18.523

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 55.103

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 2.501

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 10.589

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 50.102

Reviewed By: ajyu

Differential Revision: D30455179

fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b
2021-08-24 16:26:26 -07:00
Mike Iovine
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
dd96c26066 [TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778

This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30487425

Pulled By: ZolotukhinM

fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
2021-08-24 00:30:49 -07:00
Don Jang
84890aae35 [Static Runtime] Add an out variant op for aten::abs (#63675)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675

This change adds an out variant implementation for `aten::abs`.

Test Plan:
- Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)`

- Perf impact: TBD

Reviewed By: hlu1

Differential Revision: D30461317

fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191
2021-08-23 16:25:10 -07:00
Hao Lu
b2a601ffe5 [Static Runtime] Implement out variant for fb::quantized_linear (#63635)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635

Reviewed By: ajyu

Differential Revision: D30446234

fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030
2021-08-20 21:42:22 -07:00
Don Jang
913c1f83f4 [Static Runtime] Add native op for aten::detach (#63625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625

This change adds a static runtime's native op implementation for `aten::detach` op.

See the standard  `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison.

Test Plan:
- Added `StaticRuntime.IndividualOps_Detach`.

- Observed

```
V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1)
```

Reviewed By: hlu1

Differential Revision: D30443187

fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17
2021-08-20 00:46:27 -07:00
Philip Meier
99203580a9 Updates internal assert_allclose callsites in favor of assert_close (#61841)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841

Redo of #60863.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30408145

Pulled By: mruberry

fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58
2021-08-19 12:50:41 -07:00
Mike Iovine
47a9e8ff32 [Static Runtime] Support __getitem__ for lists (#63398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398

This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter.

Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`

Reviewed By: hlu1

Differential Revision: D30368464

fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764
2021-08-19 06:38:51 -07:00