pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
mikeiovine	56c23f5633	[SR] Out variant for embedding_bag_byte_unpack Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661 Add an out variant and wrapper in static runtime. I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup. Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)! Approved by: https://github.com/tenpercent	2022-05-25 23:24:11 +00:00
Natalia Gimelshein	225b037df8	port clamp.Tensor to structured (#77149 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/77149 Approved by: https://github.com/ezyang	2022-05-11 21:00:02 +00:00
Max Podkorytov	a41d4f27d7	[static-runtime] refactor out variant for `aten::embedding_bag` (#76207 ) Differential Revision: D35767504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76207 Approved by: https://github.com/mikeiovine	2022-05-11 18:29:18 +00:00
mikeiovine	02713221e3	[SR] Fuse clamp/nan_to_num Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094 Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor. Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)! Approved by: https://github.com/navahgar	2022-05-10 23:33:59 +00:00
Mike Iovine	9e32cdeda6	[SR] Use at::DimVector in reshape_copy_out (#76473 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473 Avoid some extra heap allocations by using DimVector ghstack-source-id: 155569314 Test Plan: Existing unit tests Reviewed By: navahgar, huiguoo Differential Revision: D35972439 fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4 (cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)	2022-05-05 17:31:54 +00:00
Natalia Gimelshein	122798916f	Port clamp_min and clamp_max to structured per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/76874 Approved by: https://github.com/bdhirsh	2022-05-05 15:52:20 +00:00
Mike Iovine	cac2733af1	[SR] Codegen for aten::clamp (#76340 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340 NNC kernel for `clamp` scalar case ghstack-source-id: 155466507 Reviewed By: navahgar, huiguoo Differential Revision: D35904019 fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f (cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)	2022-05-04 23:08:49 +00:00
Ansha Yu	ee636e2fd1	[sr] remove max_indices argument of embedding_bag when unncessary (#75993 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993 Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0 {F723683014} More details in https://fb.quip.com/MKumAjz1YD4 (`1f47a80e88`)a#temp:C:FPD3 (`ecd5567980`)e5a0871ae5d481286b511ef7 The last 3 outputs of embedding_bag are unused in the graph: P495814049. * max_indices output isn't necessary for the main output, so remove it when it's not used in the graph. * offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph. * bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now. Test Plan: `./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only` Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604` Before: I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78 0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant) After: I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2 0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant) * Ads prod canary: https://www.internalfb.com/intern/ads/canary/443002539593035806/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594` https://www.internalfb.com/intern/servicelab/602875732/ * 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594` https://www.internalfb.com/intern/servicelab/1002874745/ Reviewed By: mikeiovine Differential Revision: D35726594 fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a (cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)	2022-04-22 15:36:35 +00:00
Yukio Siraichi	22a10ce513	Port `cat` kernel to structured kernels. Tracking issue: #55070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/68640 Approved by: https://github.com/ezyang	2022-04-14 17:49:43 +00:00
Don Jang	85e163c56b	[Static Runtime] Fix a bug that `aten::full_like` reuses a tensor that does not match arguments (#74255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255 This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed. Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path. Reviewed By: mikeiovine Differential Revision: D34863639 fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608 (cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)	2022-04-05 22:30:41 +00:00
Raghavan Raman	60bda4d06b	[Static Runtime] Fix handling relu in quantized linear relu dynamic op Summary: The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter. At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue. This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16 ``` Reviewed By: mikeiovine Differential Revision: D35366309 fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9 (cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)	2022-04-04 22:16:22 +00:00
Mike Iovine	688039859f	[PyTorch][Static Runtime] out variant for where.self (#73438 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73438 Add out variant for `where.self`; requires PyTorch core changes as no out variant existed previously ghstack-source-id: 151505601 Test Plan: * Existing `where` tests in static runtime pass * CI for core `where` tests Reviewed By: hlu1 Differential Revision: D34469785 fbshipit-source-id: 8a4ebbf38b2364534fbf43812bfcfdf69ea174b3 (cherry picked from commit d3faf61f408a385d67b5b821dfaf084a8e713f30)	2022-03-17 00:14:11 +00:00
Don Jang	6294a2eb7f	[Static Runtime] Add out variant wrapper for aten::index_select (#74321 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321 This change adds out variant wrapper for aten::index_select. Test Plan: Added a unittest Reviewed By: mikeiovine Differential Revision: D34928012 fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283 (cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)	2022-03-16 23:43:21 +00:00
Mike Iovine	f14a0be302	[SR] Avoid allocating rstd/mean in layer_norm (#73606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606 The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely. ghstack-source-id: 151394116 Test Plan: Existing unit tests Reviewed By: hlu1 Differential Revision: D34562131 fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de (cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)	2022-03-15 22:07:11 +00:00
Don Jang	381c0c080f	[Static Runtime] Fix a bug that `aten::full` reuses a tensor that does not match requested one (#73990 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990 This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed. This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following. Test Plan: - Added a unittest. Reviewed By: mikeiovine Differential Revision: D34768718 fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9 (cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)	2022-03-15 16:13:52 +00:00
Don Jang	1b80f609b0	[Static Runtime] Add out variant wrapper for aten::ones_like (#73945 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945 This change adds add out variant wrapper for aten::ones_like. Test Plan: - Added a unittest. - Checked that the op execution got switched to its added out variant (P485330978). Reviewed By: hlu1 Differential Revision: D34727057 fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5 (cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)	2022-03-14 20:29:58 +00:00
Don Jang	60f22a40ef	[Static Runtime] Add out variant wrapper for aten::zeros (#73946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946 This change adds an out variant wrapper for aten::zeros. Test Plan: - Added a unittest. - Confirmed that the added out variant gets executed by the unittest (P485324923). Reviewed By: mikeiovine Differential Revision: D34725843 fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed (cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)	2022-03-11 00:52:30 +00:00
Mike Iovine	97b20b9b50	[SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704 Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message. ghstack-source-id: 150812721 Test Plan: New unit tests Reviewed By: hlu1 Differential Revision: D34596954 fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14 (cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)	2022-03-09 00:55:57 +00:00
Don Jang	71961d37bb	[Static Runtime] Add out variant wrapper for aten::ones (#73851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851 This change adds an out variant wrapper for aten::ones Test Plan: Added a unittest Reviewed By: mikeiovine Differential Revision: D34557095 fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae (cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)	2022-03-07 20:33:22 +00:00
Don Jang	c62de0ac15	[Static Runtime] [Code Cleanup] Use `SROperator` for operators' function type (#73450 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450 This change uses `SROperator` for operators' function type Test Plan: N/A Reviewed By: mikeiovine Differential Revision: D34483246 fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f (cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)	2022-03-01 02:30:48 +00:00
Mike Iovine	d398d4d32c	[SR] Disable aten::where out variant (#73367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367 The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others: ``` def forward(self, cond_1d, x, y): shape = [-1] + [1] * (x.dim() - 1) cond = cond_1d.view(shape) return torch.where(cond, x, y).clone() Condition: 01 00 [ CPUBoolType{2} ] A: 06 -9 08 -8 [ CPULongType{2,2} ] B: -4 05 -5 -2 [ CPULongType{2,2} ] Actual: 06 05 -5 -2 [ CPULongType{2,2} ] Expected: 06 -9 -5 -2 [ CPULongType{2,2} ] ``` ghstack-source-id: 149963254 Test Plan: Unit tests exercise broadcasting Reviewed By: d1jang Differential Revision: D34454770 fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4 (cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)	2022-02-26 01:08:45 +00:00
Don Jang	5772b1afbc	[Static Runtime] Avoid checks during op execution for TupleConstruct & ListConstruct (#69029 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69029 This change optimizes the execution of `prim::TupleConstruct` & `prim::ListConstruct` by performing case analysis at op loading time, not op execution time. Test Plan: - Existing unittests - Ran inline_cvr nets via ptvsc2_predictor_bench with compare_result=1 Reviewed By: swolchok Differential Revision: D32518670 fbshipit-source-id: 575b29b06eadf77ba9f1be306119fa194d4f21bf (cherry picked from commit 88cc2253b927267cad33063284e9cc66e0d31e2f)	2022-02-24 16:38:55 +00:00
Raghavan Raman	a7f9e610af	[Static Runtime] Adding out-variant support for quantized::linear_relu_dynamic_fp16 (#73238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73238 ghstack-source-id: 149706142 Test Plan: Tested on video model. Profile before: ``` 4.36922 ms. 17.1047%. quantized::linear_relu_dynamic_fp16 (14 nodes) ``` Profile after: ``` 3.80852 ms. 17.1074%. quantized::linear_relu_dynamic_fp16 (14 nodes, out variant) ``` Reviewed By: mikeiovine Differential Revision: D34287961 fbshipit-source-id: b88e2f3432215eac14fd36f945a4810d29ba1051 (cherry picked from commit 076a766ab471c362af2f2ee3b55fe75829f5e955)	2022-02-23 18:33:46 +00:00
Mike Iovine	6f84c5f0b9	[SR] Generalize VarStackNodeWrapper (#71573 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71573 Many ops (`gather_ranges_to_dense`, `sigrid_transforms`, etc) are implemented like this: ``` void op_out_(std::vector<Tensor>& output) { // actual op implementation } std::vector<Tensor> op() { std::vector<Tensor> outputs; // populate outputs with empty tensors op_out_(outputs) return outputs; } ``` This pattern is not ideal for ops that are fused with `ListUnpack` - it would be better if we wrote to the outputs directly. This diff extends the ideas from `VarStackNodeWrapper` to allow for this. The changes are: * `s/VarStackNodeWrapper/ProcessedNodeInputWrapper`. The old name was bad because the class is more general than the `VarStack` use case. Also moved the class to `processed_node_wrapper.h` * Add a `ProcessedNodeOutputWrapper`; it's essentially the same idea as `ProcessedNodeInputWrapper`, but it allows non-const access to the underlying tensors. * These classes are very similar, so CRTP is used to facilitate code re-use ghstack-source-id: 148825800 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack` Reviewed By: swolchok Differential Revision: D33687965 fbshipit-source-id: 5fa0107211116867bb2b63968c126550d32fbea6 (cherry picked from commit `75c263d960`)	2022-02-10 19:43:47 +00:00
Scott Wolchok	958f9cf5ff	[PyTorch][Static Runtime] Fix extra refcount bumps in layer_norm (#71237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71237 Noticed these on inspection. ghstack-source-id: 147171799 Test Plan: CI Reviewed By: mikeiovine Differential Revision: D33519799 fbshipit-source-id: 167c63323b345a5822303cecdbbbbb959f66f6e4 (cherry picked from commit `57e8da2d35`)	2022-01-20 00:16:17 +00:00
Scott Wolchok	bf82d2012e	[PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247 Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it. ghstack-source-id: 146933314 Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good? Reviewed By: mikeiovine Differential Revision: D33556198 fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b	2022-01-14 14:32:40 -08:00
Elias Ellison	c8332256ee	[JIT] Refactor SR invocation of fusion (#70508 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70508 We can create the code object at compile time instead or runtime to speed it up. This also makes unnecessary the compilation cache. TODO: figure out if theres a way to cache InterpreterState object Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D33458648 Pulled By: eellison fbshipit-source-id: 710389741e7c6210528f2f96ab496fcd533d942a	2022-01-10 12:16:35 -08:00
Mike Iovine	9ad21091dd	[SR] Give VarStackNodeWrapper an iterator (#69922 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69922 D32596934 (`65f54bc000`) made the serial stack implementation a bit brittle. It introduced a new container type: `VarStackNodeWrapper`. This type was used as a template parameter in the serial stack implementation. The other type used in the serial stack implementation is `at::ArrayRef<at::Tensor>`. Ideally, the interface of `VarStackNodeWrapper` should be as close as possible to this other type. However, because the new container type did not have an iterator, expressions like this would fail to compile: ``` for (const auto& tensor : tensors) { // do something } ``` Introducing this iterator will make the code easier to maintain going forward. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack` I consider this a `VarStack` implementation detail, so I'd prefer not to test it directly. We can test it implicitly by adding some code to the serial stack implementation that uses the iterator. Reviewed By: swolchok Differential Revision: D33101489 fbshipit-source-id: 7cf44c072d230c41bd9113cf2393bc6a6645a5b5	2022-01-07 07:24:47 -08:00
Scott Wolchok	4d8fc8693c	[PyTorch][Static Runtime] Support memory planning for torch.to() w/o requiring copying (#67223 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67223 ghstack-source-id: 146482215 Test Plan: See perf measurements on ctr_mobile_feed local_ro net for this stack: P467203421 (local is neutral: P467267554) Reviewed By: hlu1 Differential Revision: D31776259 fbshipit-source-id: f84fcaa05029577213f3bf2ae9d4b987b68480b3	2022-01-04 22:36:10 -08:00
Scott Wolchok	99a10c371f	[PyTorch][Static Runtime] Fix dtype changing between iterations for to() (#67394 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67394 ghstack-source-id: 146464294 Test Plan: Added new test, which failed but now passes. Checked perf on ctr_mobile_feed local net (still not on recordio inputs yet), looks neutral ``` Stable, local ======================================== I1027 13:40:23.411118 2156917 PyTorchPredictorBenchLib.cpp:131] PyTorch predictor: number of prediction threads 1 I1027 13:40:48.708222 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16975. Iters per second: 162.081 I1027 13:41:13.915948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.1487. Iters per second: 162.636 I1027 13:41:38.984462 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11408. Iters per second: 163.557 I1027 13:42:04.138948 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.13566. Iters per second: 162.982 I1027 13:42:29.342630 2156917 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.14269. Iters per second: 162.795 I1027 13:42:29.342669 2156917 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14218, standard deviation: 0.0202164 0 FixToDtypeChanges, local ======================================== I1027 13:44:59.632668 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.11023. Iters per second: 163.66 I1027 13:45:24.894635 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.16308. Iters per second: 162.257 I1027 13:45:50.275280 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.17868. Iters per second: 161.847 I1027 13:46:15.637431 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.18688. Iters per second: 161.632 I1027 13:46:40.670816 2176333 PyTorchPredictorBenchLib.cpp:249] PyTorch run finished. Milliseconds per iter: 6.10549. Iters per second: 163.787 I1027 13:46:40.670863 2176333 PyTorchPredictorBenchLib.cpp:264] Mean milliseconds per iter: 6.14887, standard deviation: 0.03843706 ``` Reviewed By: hlu1 Differential Revision: D31972722 fbshipit-source-id: 7a445b325a29020b31dd2bd61e4171ecc2793b15	2022-01-04 22:34:49 -08:00
Mike Iovine	6a84449290	[SR] Fast path for VarStack on scalars (#70210 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70210 Add a fast-path for `VarStack` nodes for when the inputs are scalars. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarStack` Reviewed By: hlu1 Differential Revision: D33177498 fbshipit-source-id: 922ab76a6808fbfdb8eb6091163a380344e38de6	2021-12-23 10:31:17 -08:00
Raghavan Raman	633f770c3c	[StaticRuntime] Add out-variant support for TensorExprDynamicGroup op (#69479 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69479 This diff adds support for out-variant optimization for `TensorExprDynamicGroup` op, which will be used for TensorExpr based fusion in Static Runtime. ghstack-source-id: 146107008 Test Plan: ``` buck run mode/opt //caffe2/caffe2/fb/predictor:pytorch_predictor_test ``` Completed accuracy test on inline_cvr model 294738512 v0. Results: ``` get 1012 prediction values get 1012 prediction values pyper_inference_e2e_local_replayer_test.out.132ea03c2 pyper_inference_e2e_local_replayer_test.out.1858bbeb0 max_error: 0 % total: 0 ``` Reviewed By: d1jang, mikeiovine Differential Revision: D32768463 fbshipit-source-id: a3e6c1ea9ff5f3b57eb89095aa79a6d426fbb52a	2021-12-22 00:30:22 -08:00
Mike Iovine	65f54bc000	[SR] Optimize VarStack (#68750 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68750 There was some room for optimization in static runtime's `prim::VarStack`: * Avoid refcount bumps - constructing the `std::vector<at::Tensor>` can be avoided by writing a custom version of `stack_out` that takes a `std::vector<at::Tensor>` Skip the memory overlap check * Avoid device dispatcher overhead in a few places (e.g. `tensor.unsqueeze -> at::native::unsqueeze`) Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack` Reviewed By: swolchok Differential Revision: D32596934 fbshipit-source-id: e8f0ccea37c48924cb4fccbfdac4e1e11da95ee0	2021-12-20 11:46:11 -08:00
Scott Wolchok	66406ee0f7	[PyTorch][Static Runtime] Fix to() w/dtype bool (#69935 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69935 Didn't realize that `AT_DISPATCH_ALL_TYPES` should really be called `AT_DISPATCH_MOST_TYPES`. ghstack-source-id: 145661358 Test Plan: Added test for dtype bool. Ran CMF local_ro net: before: ``` I1215 12:33:49.300174 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.966491. Iters per second: 1034.67 I1215 12:33:49.825570 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.94867. Iters per second: 1054.11 I1215 12:33:50.349246 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947926. Iters per second: 1054.93 I1215 12:33:50.870433 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.943779. Iters per second: 1059.57 I1215 12:33:51.393702 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.947185. Iters per second: 1055.76 I1215 12:33:51.915666 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.945672. Iters per second: 1057.45 I1215 12:33:52.438475 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948407. Iters per second: 1054.4 I1215 12:33:52.965337 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95472. Iters per second: 1047.43 I1215 12:33:53.494563 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.967083. Iters per second: 1034.04 I1215 12:33:54.017879 1606538 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.948945. Iters per second: 1053.8 I1215 12:33:54.017930 1606538 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.951888, standard deviation: 0.0083367 ``` after: ``` I1215 12:32:35.820874 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.999845. Iters per second: 1000.15 I1215 12:32:36.343147 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944363. Iters per second: 1058.91 I1215 12:32:36.863806 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.942542. Iters per second: 1060.96 I1215 12:32:37.385459 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.944677. Iters per second: 1058.56 I1215 12:32:37.905436 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941135. Iters per second: 1062.55 I1215 12:32:38.424907 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.939748. Iters per second: 1064.11 I1215 12:32:38.944643 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.941764. Iters per second: 1061.84 I1215 12:32:39.463791 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.938946. Iters per second: 1065.02 I1215 12:32:39.987567 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.95437. Iters per second: 1047.81 I1215 12:32:40.511204 1594955 PyTorchPredictorBenchLib.cpp:279] PyTorch run finished. Milliseconds per iter: 0.959139. Iters per second: 1042.6 I1215 12:32:40.511242 1594955 PyTorchPredictorBenchLib.cpp:290] Mean milliseconds per iter: 0.950653, standard deviation: 0.0184761 ``` Reviewed By: hlu1 Differential Revision: D33106675 fbshipit-source-id: 5bb581f8d0ed22ef08df1936dc8d67045e44e862	2021-12-15 15:26:56 -08:00
Mike Iovine	102684b252	[SR] Fix stack/concat bug (#68777 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68777 Fixed some cases where negative dimensions were not handled correctly * `_stack_cpu` calls `maybe_wrap_dim`, but `_stack_cpu_out` does not. This is only problematic when `_stack_cpu_out` forwards to the serial kernel: [ref](https://www.internalfb.com/code/fbsource/[1b5af978b48f2e5d308d42b588bde3275869a57b]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=1541-1547). * concat also needs to wrap its dim Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Added new tests to cover this case Reviewed By: hlu1 Differential Revision: D32604623 fbshipit-source-id: 00aaa42817cd2d3e7606ce75ab5a9744645118cf	2021-12-14 16:26:27 -08:00
Don Jang	c97dc9286d	Revert D32780415: [Static Runtime] Move implementation details from impl.h into internal.h Test Plan: revert-hammer Differential Revision: D32780415 (`999e93e6a8`) Original commit changeset: 119b7aedbf56 fbshipit-source-id: 1aa777e8c1854ab27e86bc625188f7170097fac8	2021-12-04 19:44:07 -08:00
Don Jang	999e93e6a8	[Static Runtime] Move implementation details from impl.h into internal.h (#69274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69274 `impl.h` is the main header file that defines the interface of Static Runtime to its clients. However, it is currently filled with implementation details that should not be leaked to our clients. 1) this can unnecessarily leak our internals to our clients which can make it hard to change them later 2) cause unnecessary merge conflicts when multiple people are touching this enormous impl.cpp file. To alleviate the situation, this change moves the implementation details from impl.h into a new file, internal.h, that's internally kept without leaking the details to our clients. This change will be followed by another change to rename `impl.h` into `runtime.h` or anything better since `impl.h` is currently not about implementation but SR's interface. Note that this change is NOT complete since the remaining declarations in impl.h still contain a lot of implementation details. Therefore, we should keep working on minimizing the interface to prevent our API from being bloated unnecessarily. Also we need to work on modularizing our implementations into separate pieces organized by separate files in the near future. Test Plan: Existing unittests Reviewed By: donaldong Differential Revision: D32780415 fbshipit-source-id: 119b7aedbf563b195641c5674572a9348732145f	2021-12-04 14:48:28 -08:00
Scott Wolchok	14ed4df305	[PyTorch][Static Runtime][easy] give to_copy_functor a name (#67701 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67701 I split this out to ease rebasing and review. ghstack-source-id: 144507288 Test Plan: CI Reviewed By: hlu1 Differential Revision: D32112523 fbshipit-source-id: dba14e6ada33df02dbcd7025b090a8a18cf438ae	2021-12-02 10:36:26 -08:00
Don Jang	cd3e37cbe4	[Static Runtime] [Code Cleanup] Reduce indentation depth in ops.cpp (#69028 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69028 This change converts ``` if (..) { ... } else { ... } # end of function ``` into ``` if(...) { ... return; } ... ``` in ops.cpp to remove the else branch to reduce the indentation depth by 1 for better readability. Test Plan: N/A Reviewed By: hlu1 Differential Revision: D32506235 fbshipit-source-id: a4fd5188bd680dba5dcad2b6e873735a54497664	2021-11-30 09:41:46 -08:00
Ansha Yu	7342b654a1	[static runtime] dequantize out variant (#68664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68664 Reland D32187063 (`f120335643`), fixing lint Add out variant for aten::dequantize Test Plan: Test on inline_cvr model ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1 ``` Before: 0.047472 ms. 0.409729%. aten::dequantize (9 nodes) After 0.0307179 ms. 0.267204%. static_runtime::dequantize_copy (9 nodes, out variant) Test on ctr_mbl_feed model 307210374 on 696 inputs Before: 0.0569016 ms. 0.296647%. aten::dequantize (10 nodes) After: 0.0423128 ms. 0.220481%. static_runtime::dequantize_copy (10 nodes, out variant) Reviewed By: mikeiovine Differential Revision: D32566429 fbshipit-source-id: b95dfc4c5e4115e083794093bc1571c7b1d72f5b	2021-11-30 09:03:26 -08:00
Alban Desmaison	748d9d2494	Revert D32187063: [static runtime] dequantize out variant Test Plan: revert-hammer Differential Revision: D32187063 (`f120335643`) Original commit changeset: 1fec6b74c7d3 fbshipit-source-id: 9770f8379e9ddba9e537fef0e66cc93c2caaf860	2021-11-18 10:12:31 -08:00
Ansha Yu	f120335643	[static runtime] dequantize out variant (#67873 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67873 Add out variant for aten::dequantize Test Plan: Test on inline_cvr model ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/294738512/294738512_0.predictor.disagg.local --recordio_inputs=/data/users/ansha/tmp/adfinder/294738512/294738512_0_local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=1 --iters=5 --warmup_iters=5 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1 ``` Before: 0.047472 ms. 0.409729%. aten::dequantize (9 nodes) After 0.0307179 ms. 0.267204%. static_runtime::dequantize_copy (9 nodes, out variant) Reviewed By: hlu1 Differential Revision: D32187063 fbshipit-source-id: 1fec6b74c7d3f25d0f445775c4558d30c55dcece	2021-11-18 09:31:27 -08:00
Scott Wolchok	6acde23bec	[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934 This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode. ghstack-source-id: 143429113 Test Plan: Patched d1jang's diff to measure memory turnover around SR startup. Previous diff, CMF local: ``` I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120 ``` This diff, CMF local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 72912 bytes (17%) savings ``` Perf looks neutral; see next diff (D32216573) test plan for details. Reviewed By: hlu1 Differential Revision: D32190751 fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc	2021-11-16 10:19:50 -08:00
David Berard	b546cdf401	[SR] Out variant for prim::NumToTensor (#67856 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856 Returns a tensor constructed from scalar input Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Ran ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=NumToTensorScalar --v=1 ``` and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`. Reviewed By: mikeiovine Differential Revision: D32014194 fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542	2021-11-08 09:02:58 -08:00
Bin Wen	1baed45c6b	[fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663 mostly follow the example of quantized::linear (D28428734 (`4d7abdbdad`)) to enable out-variant for quantized::linear_dynamic_fp16. Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d Test Plan: buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 \| pastry before: P465201159 0.929067 ms. 31.808%. quantized::linear_dynamic_fp16 (16 nodes) 0.921679 ms. 31.7324%. quantized::linear_dynamic_fp16 (16 nodes) 0.919127 ms. 31.7404%. quantized::linear_dynamic_fp16 (16 nodes) after: P465203015 0.90898 ms. 31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.9127 ms. 30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.879148 ms. 31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant) unit test logic refers https://fburl.com/code/vv0rry13 buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D32001168 fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3	2021-11-03 22:39:04 -07:00
Scott Wolchok	7cd62621fb	[PyTorch] Adopt faster Tuple::create (#65381 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65381 The previous diff adds a way to make Tuples of size 3 or less more efficiently. This diff makes it easier to hit that path and updates a bunch of callsites to hit it. ghstack-source-id: 142065832 Test Plan: CI Reviewed By: ezyang Differential Revision: D31069538 fbshipit-source-id: d04da3709594ed68ab1c0a1471f8cffd8d001628	2021-11-02 10:10:31 -07:00
Mike Iovine	7da9c4ed2e	[SR] NNC out variant for aten::where (#67255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255 Add an out variant for `aten::where`. Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well. Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: navahgar Differential Revision: D31923886 fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837	2021-10-28 06:48:22 -07:00
Hao Lu	9ebc6357b3	[SR] Vectorize int version of fmod (#67313 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313 Reviewed By: swolchok Differential Revision: D31889868 fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd	2021-10-27 17:02:53 -07:00
Mike Iovine	d5f64afc38	[Static Runtime] Support aten::to.prim_dtype overload (#64928 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928 Added support this overload of `aten::to`: ``` aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a\|b) ``` Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to` Reviewed By: hlu1 Differential Revision: D30901398 fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb	2021-10-07 10:22:44 -07:00
Scott Wolchok	ffede499b2	[PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499 When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc. ghstack-source-id: 139549027 Test Plan: Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options). Before: I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252 I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871 I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916 I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093 I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971 mean: 7.144172, stddev: 0.1283 After: I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122 I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579 I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033 I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833 I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85 mean: 6.800632, stddev: 0.013227 Looks like about a 5.3% improvement. Reviewed By: hlu1 Differential Revision: D31125492 fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823	2021-10-01 12:13:33 -07:00

1 2 3 4 5

211 Commits