pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Hao Lu	47bc47f2b9	[SR] Add runtime check to correct bad schema alias info (#67825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825 The comment explains how it works. Test Plan: A small regression to local and local_ro if we only enable it for fallback ops. ``` ## local_ro # before I1103 21:25:05.250440 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247 I1103 21:25:08.629221 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319 I1103 21:25:12.005179 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759 I1103 21:25:12.005236 `2636751` PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619 # after # # only enable for fall back ops: 0.7% I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481 I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262 I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51 I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424 # enable for all (no clone): 4.7% I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005 I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14 I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506 I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765 # local # before I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233 I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641 I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735 I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206 # after # # only enable for fall back ops: 0.1% I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349 I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601 I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659 I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658 # enable for all (no clone): 0.9% I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783 I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611 I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037 I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498 ``` Reviewed By: d1jang Differential Revision: D32124812 fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa	2021-11-10 19:35:11 -08:00
Mike Iovine	5e19fb61fd	[SR] Release reference to JIT module if possible (#67911 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67911 If we can remove `self` from the graph inputs, there is no need for `StaticModule` to hold onto its `Module` reference anymore. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32190755 fbshipit-source-id: 9c4649a63b6e68c7d2e47395a23572985d2babb1	2021-11-09 10:35:31 -08:00
Mike Iovine	5bc89275dd	[SR] Eliminate no-ops (#67437 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437 Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this. Test Plan: `buck test caffe2/test:jit -- test_freezing` Reviewed By: hlu1 Differential Revision: D31980843 fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0	2021-11-08 08:42:33 -08:00
Scott Wolchok	b0c05297f9	[Static Runtime] Arena allocate StorageImpls for managed tensors (#66130 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66130 We're reusing backing storage for these tensors, which is only safe because they have non-overlapping lifetimes. Accordingly, it seems that they can also share their StorageImpl. ghstack-source-id: 142427752 Test Plan: benchmarked ctr_mobile_feed local and local_ro: Using recordio inputs for model 302008423_0 ``` swolchok@devbig032 ~/f/fbcode> env MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 > environment^C swolchok@devbig032 ~/f/fbcode> sudo ~/fbsource2/fbcode/scripts/bertrand/noise/denoise-env.sh \ /tmp/ptvsc2_predictor_benchNov1ArenaAllocateStorageImpls \ --scripted_model=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.predictor.disagg.local \ --method_name=local.forward --pt_cleanup_activations=1 \ --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=2 --warmup_iters=2 \ --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 --repetitions=5 --recordio_use_ivalue_format=1 --recordio_inputs=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.local.inputs.recordio Stable ======================================== I1101 14:19:16.473964 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0131. Iters per second: 49.9673 I1101 14:20:12.193130 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0155. Iters per second: 49.9612 I1101 14:21:07.761898 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9751. Iters per second: 50.0624 I1101 14:22:03.218066 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9104. Iters per second: 50.2249 I1101 14:22:58.723256 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.956. Iters per second: 50.1102 I1101 14:22:58.723306 2748837 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.974, standard deviation: 0.043643 ArenaAllocateStorageImpls ======================================== I1101 14:08:57.070914 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9771. Iters per second: 50.0572 I1101 14:09:52.605121 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.924. Iters per second: 50.1907 I1101 14:10:48.098287 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9353. Iters per second: 50.1624 I1101 14:11:43.645395 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9723. Iters per second: 50.0694 I1101 14:12:39.171636 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9673. Iters per second: 50.0819 I1101 14:12:39.171685 2695478 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.9552, standard deviation: 0.0239318 difference: 0.0188 (0.09%), which is less than 1 standard deviation Stable, local_ro ======================================== I1101 14:26:10.796161 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25991. Iters per second: 793.708 I1101 14:26:12.194727 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26862. Iters per second: 788.26 I1101 14:26:13.591312 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26549. Iters per second: 790.207 I1101 14:26:14.982439 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25943. Iters per second: 794.01 I1101 14:26:16.377033 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25995. Iters per second: 793.68 I1101 14:26:16.377094 2787930 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.26268, standard deviation: 0.00414788 ArenaAllocateStorageImpls, local_ro ======================================== I1101 14:26:45.875073 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20987. Iters per second: 826.536 I1101 14:26:47.207271 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20827. Iters per second: 827.633 I1101 14:26:48.533766 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20023. Iters per second: 833.174 I1101 14:26:49.850610 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19206. Iters per second: 838.884 I1101 14:26:51.172356 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19958. Iters per second: 833.622 I1101 14:26:51.172411 2790009 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.202, standard deviation: 0.00722754 Difference: 0.06 usec/iter (4.8%), which is much more than 1 standard deviation ``` we can see that this is a large relative improvement on local_ro, but no effect on local. Reviewed By: hlu1 Differential Revision: D31357486 fbshipit-source-id: 229c003677da76e89c659d0e0639002accced76e	2021-11-04 15:43:39 -07:00
Mike Iovine	0eaa01ead1	[SR] Add EliminateTrivialEquallySplit graph pass (#67166 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67166 This optimization is not really the same thing as `FuseListUnpack`, and mixing the logic in that pass is confusing and error-prone. It should really be its own pass. It's slower since we have to do another pass over the graph, but this is not perf critical code; readability is more important. Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: hlu1 Differential Revision: D31887458 fbshipit-source-id: 289e281d512435861fccfe19f017751ad015688c	2021-11-03 12:57:05 -07:00
Scott Wolchok	510336499b	[PyTorch][Static Runtime] Separate overlap checks for easier debugging (#66637 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66637 We can give more information when verify_no_memory_overlap would fail by separating the DCHECK. ghstack-source-id: 142226105 Test Plan: fitsships Reviewed By: d1jang Differential Revision: D31517151 fbshipit-source-id: 8cbc324c27f6b4db4489d1bd469d37b1d8ae6ce1	2021-11-02 23:59:04 -07:00
Don Jang	e86a5a3a1a	[Static Runtime] Add PyTorchPredictor::predict_managed_result to return managed output tensors (#65598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65598 This change adds `PyTorchPredictor::predict_managed_result` to enable Static Runtime to return managed output tensors, allocated and owned by Static Runtime to accelerate inference workloads. - `PyTorchPredictor::predict_managed_result` does only meaningful work for the overridden `PyTorchStaticRuntimePredictor::predict_managed_result`. For other subclasses, it returns a simple object that just wraps the returned `Ivalue`. - When `manage_output_tensors` is enabled, a `StaticRuntime` cannot be reentered until its return value gets deallocated by calling `StaticRuntime::deallocateOutputTensors`. Currently an instance of `StaticRuntime` gets immediately pushed back to `static_runtime_pool` to be reentered again, and this cannot be done when `manage_output_tensors` is enabled. `PyTorchStaticRuntimePredictorManagedResult` makes sure to delay pushing a `StaticRuntime` instance back to the pool only after `StaticRuntime::deallocateOutputTensors` is called on the runtime instance. - When `manage_output_tensors` is enabled, `PyTorchStaticRuntimePredictor::predict_managed_result` returns the prediction result, whose backing memory is managed by an instance of `StaticRuntime`. The lifetime of any value reachable from `PyTorchStaticRuntimePredictorManagedResult.get()` is expected to end before `PyTorchStaticRuntimePredictorManagedResult` gets destructed. As explained above, `PyTorchPredictorManagedResult`'s destruction pushes the runtime instance that returned the result back to `static_runtime_pool` to be reused again. - The current API design of adding `predict_managed_result` instead of forcing `operator()` to return `PyTorchPredictorManagedResult` was motivated by the fact that `manage_output_tensors` will be selectively enabled just for a few models. In case `manage_output_tensors` becomes a commonly used feature we should revisit this API design to merge them together. Reviewed By: hlu1 Differential Revision: D31149323 fbshipit-source-id: 5ca026188077232d6a49a46759124a978439d7b2	2021-11-02 22:10:26 -07:00
Mike Iovine	2644725937	[SR] Migrate gather_ranges_to_dense to new FuseListUnpack (#67164 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67164 Migrated both the variadic and non-variadic versions. This diff is part of the effort to migrate all ops used in `FuseListUnpack` to `FuseListUnpackV2`. The original version of `FuseListUnpack` is problematic for a few reasons: * You have to complicate the op implementation with an `is_fused` check, resulting in messier code. It is easier to reason about two ops, fused (out variant) and unfused (native). * The original version of `FuseListUnpack` is buggy. It assumes that the `ListUnpack` node occurs immediately after the fusion candidate, which is not necessarily true. This diff finishes the migration, so the original version of `FuseListUnpack` is removed Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy Test Done at the top of this diff stack. Reviewed By: hlu1 Differential Revision: D31887386 fbshipit-source-id: 9d44c813667a75bce13dce62bf98e6109edea6ba	2021-11-02 11:04:59 -07:00
Scott Wolchok	82f7f8d471	[PyTorch] Adopt IValue::toTupleRef() where obvious (#65505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505 Generated with `fastmod -m 'toTuple(\s)->' 'toTupleRef()${1}.'` , followed by `fastmod '(std::move$.)toTupleRef\($.' '${1}toTuple()->'` to unbreak 2 callsites. ghstack-source-id: 142065835 Test Plan: CI Reviewed By: gchanan Differential Revision: D31131025 fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34	2021-11-02 10:22:18 -07:00
Scott Wolchok	7cd62621fb	[PyTorch] Adopt faster Tuple::create (#65381 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65381 The previous diff adds a way to make Tuples of size 3 or less more efficiently. This diff makes it easier to hit that path and updates a bunch of callsites to hit it. ghstack-source-id: 142065832 Test Plan: CI Reviewed By: ezyang Differential Revision: D31069538 fbshipit-source-id: d04da3709594ed68ab1c0a1471f8cffd8d001628	2021-11-02 10:10:31 -07:00
Mike Iovine	0d7cf825fc	[SR] Drop support for aten::__is__ and aten::__isnot__ (#67550 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550 `aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations. Some examples: - Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this: ``` def forward(x): y = x.detach() return x is y ``` We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`. - `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards ``` def forward(x): y = x.to(x.dtype) return x is y ``` - And many more, `FuseListUnpack` can break too Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D32022584 fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e	2021-11-01 04:45:14 -07:00
Don Jang	ad89d994c9	[Static Runtime] Support recordio format input for benchmark (#67530 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67530 Currently `ptvsc2_predictor_bench` only uses the first input of a given recordio file even when the record io file contains many inputs. This change extends `StaticRuntime::benchmark` to accept multiple input entries so that we can benchmark more extensibly and realistically using all the inputs in the recordio file. Test Plan: Tested `ptvsc2_predictor_bench` with / without this change executing the following command: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423_0.predictor.disagg.local --recordio_inputs=/home/djang/ads/adfinder/ctr_mobilefeed/302008423/302008423.local.inputs.recordio --pt_enable_static_runtime=1 --compare_results=0 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=1 --method_name=local.forward --set_compatibility --do_benchmark=1 --recordio_use_ivalue_format=1 ``` Reviewed By: hlu1 Differential Revision: D31947382 fbshipit-source-id: 4188271613aad201f8cad5f566e0dfed26680968	2021-10-29 14:38:14 -07:00
Scott Wolchok	9f01937caf	[PyTorch][easy] Deduplicate memory planner creation code (#67265 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67265 Avoid repeating this initialization code. ghstack-source-id: 141585971 Test Plan: CI Reviewed By: hlu1 Differential Revision: D31933368 fbshipit-source-id: 6342ae9bb82c4d152a427bad142470c3d162de69	2021-10-28 14:13:43 -07:00
Mike Iovine	8363da3f92	[SR][C2][easy] Benchmarks report # of ops (#67436 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67436 This information is useful for comparing static runtime to c2 Reviewed By: d1jang Differential Revision: D31991571 fbshipit-source-id: eb83bc4564b05d56fb9a550863eea3f6312f3f6c	2021-10-28 13:03:09 -07:00
Mike Iovine	72e25c9f4e	[Static Runtime][DI] Add variadic grouped_accessor_op (#66289 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66289 Add a variadic version of `grouped_accessor_op` to eliminate list construction overhead and associated refcount bumps in static runtime. Test Plan: Accuracy test with model 294738512_40: passes with 0 errors. Accuracy test with model 296213501_65 (has V2 op): passes with 0 errors. Perf impact TW replayer test w/ 800 QPS (stacked with D31620408) shows ~5% CPU decrease for storage tier. Results: {F673610665} Reviewed By: hlu1 Differential Revision: D31482816 fbshipit-source-id: 14393da122cefd094c3e4f423beb897c1d17b32c	2021-10-27 12:29:33 -07:00
Scott Wolchok	6ce14e7b51	[PyTorch][Static Runtime] Cleanup: add valueVecFromFastSet (#66996 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66996 We do this conversion a few times, and further diffs (which I'm trying to keep as small as possible) will do it more. ghstack-source-id: 141496817 Test Plan: CI Reviewed By: mikeiovine Differential Revision: D31821037 fbshipit-source-id: 1d3b54cadaedd53189aec6a35ed1a126c6fe4824	2021-10-26 14:47:15 -07:00
Mike Iovine	83355f9537	[SR][easy] Alias for c10::Symbol::fromQualString (#67162 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67162 It's a bit annoying/ugly to type `c10::Symbol::fromQualString` everywhere, and we can't do `using c10::Symbol::fromQualString` since it's a static class function. Test Plan: CI Reviewed By: d1jang Differential Revision: D31887042 fbshipit-source-id: 073a56c72281c20284a9feef741aed96b58a921d	2021-10-26 06:09:17 -07:00
Hao Lu	0c1b7545b6	[Static Runtime] Add more debug info to verify_no_memory_overlap() (#67206 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67206 The memory overlap check still checks the memory overlap for alias ops. It only skips the check for inplace ops. This needs to be fixed if we want to use the memory overlap check in prod. This diff only adds more debug info. It doesn't fix the aforementioned problem. Reviewed By: d1jang Differential Revision: D31889866 fbshipit-source-id: 05a80ace3d404f66f21a8bbdc9678485ff76c8d3	2021-10-26 01:48:41 -07:00
Mike Iovine	a0495b3cdb	[SR] Remove unused operator() overload (#67001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001 The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere. This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` `buck test caffe2/test:static_runtime` Reviewed By: hlu1 Differential Revision: D31821990 fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a	2021-10-25 08:18:58 -07:00
Mike Iovine	364645cd9d	[SR] Factor operator() implementation into separate function (#67125 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67125 Using explicit template instantiations in D31659973 (`f2582a59d0`) was a bad idea. The problem is that the lvalue instantiation was for a `const` vector of `IValue`, meaning that if you tried to pass SR a non-const vector of arguments, the linker would fail to find the symbol. The reason we didn't catch this in D31659973 (`f2582a59d0`) was because predictor always passes a `const` reference anyways. But we should fix this to prevent unexpected problems in the future. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: hlu1 Differential Revision: D31873406 fbshipit-source-id: 5ab5a03334bed925cec11facadcedf9bec9b90ad	2021-10-25 08:17:40 -07:00
Mike Iovine	f2582a59d0	[SR] Add rvalue overload for operator() (#66648 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648 Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes: 1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`. 2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`. Old: ``` operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs) ``` New: ``` template <class IValueList> operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs) ``` The implementations use perfect forwarding to invoke the correct overload of `set_inputs`. Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass. Reviewed By: hlu1 Differential Revision: D31659973 fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed	2021-10-22 10:51:47 -07:00
Mike Iovine	391eb1dbe3	[JIT] UseVariadicOp handles multiple lists (#66288 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66288 This change makes it so `UseVariadicOp` can transform ops with many Tensor list inputs. Input pattern: ``` %output : Type = op(%list_1, %arg_1, %list_2, %list_3) ``` Output pattern: ``` %output : Type = variadic_op(%list_11, ..., %list_1N, %arg_1, %list_21, ..., %list_2M, %list_31, ..., %list_3K, N, M, K) ``` The length of each list is passed at the end of the variadic op so that the op implementation can process the inputs appropriately. This also frees us from needing to update `hasVarArgs` in static runtime each time we add a variadic op. This diff also makes `UseVariadicOp` more robust. Before, `list_idx` was passed as an argument. Now, `VariadicUpdater` determines `list_idx` from the node's schema. Test Plan: Existing variadic ops do not break: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31450811 fbshipit-source-id: 808fcc3ae8940b9e602586f38f8cf9154c9a6462	2021-10-22 10:22:33 -07:00
Don Jang	051ea5ccbf	[Static Runtime] Bundle function & function_kind to carry them together (#66974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66974 `D31591785 (`67e003f09b`)` started carrying a function object to be executed and `FunctionKind` for the type of the function separately, and this caused a bug fixed by D31783028 (`79803b199f`). This change bundles them as it was before done by swolchok to reduce the chances of such a mistake in the future. They need to be carried altogether always since `FunctionKind` identifies the type of the function object. Note that `struct Function` is a POD type, so accessing its field (first, second) shouldn't cause an extra overhead in `ProcessedNode::run()`. Test Plan: Confirmed that the managed memory metics remain the same before/after this diff on inline_cvr: ``` #AFTER # inline_cvr/local Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) # inline_cvr/local_ro Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2679 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%) # inline_cvr/remote_ro First iter time: 12.0344 ms Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ``` ``` #BEFORE # inline_cvr/local Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) #inline_cvr/local_ro Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2679 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1939 (99.4327%) #inline_cvr_remote_ro Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ``` Reviewed By: mikeiovine Differential Revision: D31798419 fbshipit-source-id: fd4301b6731e402be0820729654735c791511aba	2021-10-22 08:57:49 -07:00
Mike Iovine	ab1e4eac42	[Static Runtime] Add FuseListUnpackV2 (#66509 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66509 Like `FuseListUnpack`, but instead of adding arguments to the fused node's outputs, inserts a new fused op. By using a new fused op, we can avoid runtime `is_fused` checks. This will make the op implementations significantly cleaner. Eventually, we will migrate all ops to `V2` and delete to old pass. `FuseListUnpackV2` also fixes the bug described in T103159043. Test Plan: I've made some changes to D31550307 locally and verified that everything works. Reviewed By: hlu1 Differential Revision: D31492017 fbshipit-source-id: 4f90fcbc17e4c70a3d65985bee836fabf868a22c	2021-10-20 16:39:32 -07:00
Don Jang	67e003f09b	[Static Runtime] Determine function for `ProcessedNode::run()` statically (#66692 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66692 Currently `ProcessedNode::run()` performs 2 dynamic dispatches to decide which function implementation to execute depending on if the function is an out variant / native / or interpreter fallback. Note that this is happening every time an operation is executed by Static Runtime dynamically. This change makes that same decision during module loading time once so that we can remove 1 dynamic dispatch cost at runtime. size reduction Saving 4 bytes per `ProcessedNode`. - Before: sizeof(c10::variant<OutVariant, NativeFunction, Operation>):40 - After: sizeof(std::function<void(ProcessedNode)>): 32 + sizeof(FunctionKind):4 = 36 latency optimization* Expected to remove 2 memory loads & 1 conditional jump per `ProcessedNode::run()` execution (needs to be confirmed from compiled binary code). Ran `ptvsc2_predictor_bench` with `inline_cvr` with 1000 iterations: - local : 7.56026 -> 7.24794 - local_ro: 1.5799. -> 1.55504. - remote_ro: 10.6464 -> 10.3017 Test Plan: Ran existing unittests Reviewed By: swolchok Differential Revision: D31591785 fbshipit-source-id: 5de83ca386af509381e08ecedf071ee4e9f0f0b0	2021-10-15 14:07:24 -07:00
Scott Wolchok	e88d1c4f10	[PyTorch] Add tuple inline storage (#64066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066 I noticed a bunch of time being spent heap-allocating Tuples in the unpickler. 1-, 2-, and 3-element Tuples are apparently common enough that they get their own bytecode instructions, so I decided to try also giving them their own representation. We store up to 3 IValues inline in `Tuple` rather than doing a second heap allocation for a `std::vector<IValue>`. ghstack-source-id: 140695395 Test Plan: Added automated tests for TupleElements. Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284 Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422 We went from 347 ms to 302 ms. Reviewed By: dhruvbird Differential Revision: D30592622 fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8	2021-10-15 12:16:51 -07:00
Hao Lu	6310eb30d1	[SR] Clean up GetLivenessMap (#66606 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66606 - Remove dead code (see comment for where) - Add debug prints - Small reorganization of the code to improve readability Reviewed By: d1jang Differential Revision: D31568219 fbshipit-source-id: 50240c325bf4fd012e1947ac931bb67c6f5dfafb	2021-10-13 23:55:40 -07:00
Hao Lu	6634570aef	[SR] Fix bug in ValueGroup (#66470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470 Reviewed By: d1jang Differential Revision: D31566348 fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3	2021-10-13 19:26:38 -07:00
Scott Wolchok	d30397d42a	[PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429 The sizes of these arrays can't change, so there's no need to waste an extra pointer on them. ghstack-source-id: 140532722 Test Plan: CI I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%. Reviewed By: hlu1 Differential Revision: D31096042 fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395	2021-10-13 19:13:20 -07:00
Don Jang	736fa09a9a	[Static Runtime] Manage output tensors (#65515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515 This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows: - At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff. - At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple). - The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client. - In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory. - Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies. This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed. Test Plan: - Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths. - Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`). Reviewed By: hlu1 Differential Revision: D31049221 fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383	2021-10-11 09:50:54 -07:00
Scott Wolchok	5a67ffe0ad	[PyTorch][Static Runtime] Combine ProcessedNode::{native_,}fn_ (#65414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65414 Saves 24 bytes (`sizeof(std::function) - 8`) per ProcessedNode. ghstack-source-id: 139999909 Test Plan: CI Reviewed By: hlu1 Differential Revision: D31085561 fbshipit-source-id: 70734b8319e805736ba41aedaaf7fa3d463400c9	2021-10-08 18:11:59 -07:00
Scott Wolchok	3ef69a4598	[static runtime] Pre-allocate hash tables (#65343 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65343 No reason not to save a bit on re-hashing. ghstack-source-id: 140052518 Test Plan: CI Static runtime startup seems to go from 5.9-6.0s to 5.8s-6.0s, perf shows less time spent rehashing Reviewed By: mikeiovine Differential Revision: D31027362 fbshipit-source-id: 39dd53ecd462693b518535856ddd92df78a4977b	2021-10-08 10:28:13 -07:00
Don Jang	416f593080	[Static Runtime] Group graph nodes into input aliases & output aliases (#65517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517 This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows: - input_aliases: values that are either inputs or contain aliases of inputs or constants. - output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases. - Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph. `output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance. Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before. Reviewed By: hlu1 Differential Revision: D30940955 fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc	2021-10-07 14:35:12 -07:00
Mike Iovine	057a01556c	[Static Runtime] Do not use variadic_sigrid_transforms_torch_bind if out variant is disabled (#66221 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66221 JIT doesn't have an implementation for this op, so we can only use it when out variants are enabled. Reviewed By: hlu1 Differential Revision: D31445887 fbshipit-source-id: 4565ac4df751d8ee4052647574c43efa05ea1452	2021-10-07 06:57:17 -07:00
Mike Iovine	a5e6b2b2e3	[Static Runtime] Add variadic sigrid_transforms_torch_bind (#63960 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63960 Reviewed By: hlu1 Differential Revision: D30529880 fbshipit-source-id: 1c4be2f9c0944bbe1e1c146989588c96bfd14eda	2021-10-05 16:00:36 -07:00
Hao Lu	a6ad2b41ac	[Static Runtime] Make module_ optional in StaticModule (#65882 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65882 `torch::jit::Module` is refcounted. There is no need to wrap it in a `shared_ptr`. Test Plan: ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: mikeiovine Differential Revision: D31012222 fbshipit-source-id: 74d234bd85423e5ba0e396f24899631354a2c74b	2021-09-30 22:48:49 -07:00
Don Jang	4176afc4a0	[Static Runtime] Disable SigridTransform + ListUnpack fusion when outputs reachable from graph output (#62697 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62697 Reviewed By: hlu1 Differential Revision: D29979402 fbshipit-source-id: 913e8396a0530ce3617211112a2b1147ef2e9df9	2021-09-29 22:47:48 -07:00
Mike Iovine	b003b2a9c0	[Static Runtime] Add record functions (#64698 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64698 Reviewed By: hlu1 Differential Revision: D30747191 fbshipit-source-id: 7ded6ea9bd36b5e3343d1efa9f3c92e02ff6d7f8	2021-09-24 07:20:17 -07:00
Raghavan Raman	14307f7a56	[Static Runtime] Added logging to dump the model graphs (#65509 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65509 With this change, we can get dumps of the model graphs by setting the env variable `PYTORCH_JIT_LOG_LEVEL=">>impl"` while running the model. Test Plan: buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: mikeiovine Differential Revision: D31125797 fbshipit-source-id: d8979a4e138047518140e0eaecb46e012891b17c	2021-09-23 10:06:13 -07:00
Raghavan Raman	31584d065e	[Static Runtime] Added NNC implementation for signed log1p kernel. (#65387 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387 Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op. Also, added a SR microbenchmark for this kernel which shows the performance improvement. Without fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 1953 ns 1953 ns 358746 BM_signed_log1p/64 2049 ns 2049 ns 342145 BM_signed_log1p/512 3291 ns 3291 ns 214342 BM_signed_log1p/4096 15559 ns 15559 ns 44420 BM_signed_log1p/32768 101936 ns 101935 ns 6843 BM_signed_log1p/65536 194792 ns 194789 ns 3615 ``` With NNC fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 369 ns 369 ns 1896179 BM_signed_log1p/64 497 ns 497 ns 1406995 BM_signed_log1p/512 1618 ns 1618 ns 430209 BM_signed_log1p/4096 11327 ns 11326 ns 61463 BM_signed_log1p/32768 84099 ns 84086 ns 8325 BM_signed_log1p/65536 166531 ns 166510 ns 4186 ``` This clearly shows >15% improvement in performance of this kernel with NNC fusion. On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops: without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved) with NNC fusion: `0.55%` Test Plan: `buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1) ``` get 57220 prediction values get 57220 prediction values max_error: 0 total: 0 ``` Reviewed By: hlu1 Differential Revision: D30609492 fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd	2021-09-22 15:53:33 -07:00
Scott Wolchok	c0eb266c02	[Static runtime] Micro-optimization pass on GetLivenessMap (#65175 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65175 More efficient use of map API, more efficient way to insert all pairs of inputs/outputs in liveness map ghstack-source-id: 138547815 Test Plan: Time to enable static runtime down from ~8.7s to ~8.4s Reviewed By: mikeiovine Differential Revision: D30983897 fbshipit-source-id: fa6000bfd0fa0adfcd7c5922199ee32ada8c430e	2021-09-21 10:52:08 -07:00
Mike Iovine	99e4ab5d44	[Static Runtime] Implement and enable variadic tuple unpack (#64934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934 Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from: ``` %0, %1 = prim::TupleUnpack(%a) %2, %3 = prim::TupleUnpack(%b) ``` into: ``` %0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b) ``` The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack` Reviewed By: d1jang Differential Revision: D30872109 fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9	2021-09-20 10:36:11 -07:00
Don Jang	7f8d622d70	[Static Runtime] Add perf metrics for number of managed tensors & unmanaged values (#64992 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64992 This change lets Static Runtime print out number of managed tensors & unmanaged values as performance metrics during profile runs. We will use /enhance these metrics to guide the effort of managing output tensors. Test Plan: Confirmed that a profile run prints out the added metric values on inline_cvr nets: ``` (inline_cvr/local) ... Total number of managed tensors: 2754 Total number of unmanaged values: 3240 ... (inline_cvr/local_ro) Total number of managed tensors: 1554 Total number of unmanaged values: 2966 ... (inline_cvr/remote_ro) Total number of managed tensors: 1439 Total number of unmanaged values: 28 ... ``` Reviewed By: hlu1 Differential Revision: D30926617 fbshipit-source-id: b86e071003ac941b9663db103eaa7c614466b4e0	2021-09-18 11:26:37 -07:00
Don Jang	ae00075ac7	[Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123 This change re-reverts D30883290 (`0e11454d19`). D30883290 (`0e11454d19`) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`. ``` ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod); Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 auto sr = getStaticRuntime(); Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 2 errors generated. ``` This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine). Original Summary: This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp. `MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors. This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support. Test Plan: - Confirm that OSS build went well (See External Tests section). Reviewed By: mikeiovine Differential Revision: D30983292 fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116	2021-09-17 13:32:01 -07:00
Don Jang	8241193d76	[Static Runtime] Introduce static_runtime::dict_unpack (#64771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771 Test Plan: - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict` - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict` - TBD: Perf impact measurement Reviewed By: mikeiovine Differential Revision: D30685083 fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9	2021-09-16 23:25:13 -07:00
Scott Wolchok	f69cf3cf2f	[Static Runtime] Use FastSet instead of std::set everywhere (#65114 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65114 There doesn't seem to be any reason to use std::set for sets of pointers, right? ghstack-source-id: 138198504 Reviewed By: hlu1 Differential Revision: D30978450 fbshipit-source-id: 4599c6249fda3a89959f839d3bf6400c5891f82c	2021-09-15 21:44:54 -07:00
Natalia Gimelshein	ec1af11c2e	Revert D30883290: [Static Runtime] Move MemoryPlanner out into memory_planner.cpp Test Plan: revert-hammer Differential Revision: D30883290 (`0e11454d19`) Original commit changeset: a37570f8d943 fbshipit-source-id: 65c57a2b0d2e3c7006765195dd519e8cf2472f72	2021-09-15 15:40:34 -07:00
Don Jang	0e11454d19	[Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65011 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65011 This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp. `MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors. This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support. Test Plan: N/A Reviewed By: mikeiovine Differential Revision: D30883290 fbshipit-source-id: a37570f8d9430224a6987d2190bcf81cf875043d	2021-09-15 12:57:39 -07:00
Don Jang	3fb33b38b9	[Static Runtime] Check if outputs of a node do not overlap with each other (#63013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013 This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs. This check will detect a problem like T97393697 immediately in debug mode. Test Plan: - Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs` - Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run. Reviewed By: hlu1 Differential Revision: D30211705 fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0	2021-09-15 08:38:05 -07:00
Mike Iovine	369db8924f	[Static Runtime] Add first iter metric (#64457 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64457 The first iteration is special since it initializes the memory planner. This change logs and reports first iteration time during benchmarking. It also generates a FAI-PEP output when `generate_ai_pep_output` is set. Test Plan: Run any benchmark, and observe: ``` I0902 15:19:32.528977 2492358 impl.cpp:948] PyTorchObserver {"value":6.415958881378174,"unit":"ms","metric":"latency","type":"static_runtime_first_iter"} ... First iter time: 6.41596 ms ``` Note that this metric is likely to have significantly more noise than the others since we don't have as many data points. Unit tests: `buck test //caffe2/test:static_runtime` Reviewed By: d1jang Differential Revision: D30740619 fbshipit-source-id: 4dcfccd5629f4fa34254fd355073ef19e151245a	2021-09-07 15:00:30 -07:00

1 2 3

145 Commits