pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Mike Iovine	a0495b3cdb	[SR] Remove unused operator() overload (#67001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001 The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere. This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` `buck test caffe2/test:static_runtime` Reviewed By: hlu1 Differential Revision: D31821990 fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a	2021-10-25 08:18:58 -07:00
Mike Iovine	f2582a59d0	[SR] Add rvalue overload for operator() (#66648 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648 Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes: 1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`. 2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`. Old: ``` operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs) ``` New: ``` template <class IValueList> operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs) ``` The implementations use perfect forwarding to invoke the correct overload of `set_inputs`. Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass. Reviewed By: hlu1 Differential Revision: D31659973 fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed	2021-10-22 10:51:47 -07:00
Aditya Pillai	40a8a50913	Add static_runtime::fused_equally_split (#2 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881 Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator. Test Plan: ``` adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s ❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` and sandcastle strange_what_could_go_wrong Reviewed By: mikeiovine Differential Revision: D31742293 fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39	2021-10-22 10:26:49 -07:00
Don Jang	18bbc4c2b7	[Static Runtime] Fix a bug in aten::index (#66940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940 `aten::index`'s schema is as follows: ``` "aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor ``` The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor. Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`. Reviewed By: hlu1 Differential Revision: D31712145 fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3	2021-10-20 15:51:21 -07:00
Hao Lu	6634570aef	[SR] Fix bug in ValueGroup (#66470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470 Reviewed By: d1jang Differential Revision: D31566348 fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3	2021-10-13 19:26:38 -07:00
Scott Wolchok	d30397d42a	[PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429 The sizes of these arrays can't change, so there's no need to waste an extra pointer on them. ghstack-source-id: 140532722 Test Plan: CI I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%. Reviewed By: hlu1 Differential Revision: D31096042 fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395	2021-10-13 19:13:20 -07:00
Mike Iovine	37db650c9c	[Static Runtime] Clone test does not use uninitialized memory (#66557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557 The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31611961 fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8	2021-10-13 14:02:34 -07:00
Don Jang	736fa09a9a	[Static Runtime] Manage output tensors (#65515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515 This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows: - At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff. - At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple). - The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client. - In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory. - Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies. This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed. Test Plan: - Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths. - Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`). Reviewed By: hlu1 Differential Revision: D31049221 fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383	2021-10-11 09:50:54 -07:00
Don Jang	416f593080	[Static Runtime] Group graph nodes into input aliases & output aliases (#65517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517 This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows: - input_aliases: values that are either inputs or contain aliases of inputs or constants. - output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases. - Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph. `output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance. Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before. Reviewed By: hlu1 Differential Revision: D30940955 fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc	2021-10-07 14:35:12 -07:00
Mike Iovine	d5f64afc38	[Static Runtime] Support aten::to.prim_dtype overload (#64928 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928 Added support this overload of `aten::to`: ``` aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a\|b) ``` Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to` Reviewed By: hlu1 Differential Revision: D30901398 fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb	2021-10-07 10:22:44 -07:00
Mike Iovine	6d7fab5929	[Static Runtime][easy] Clone scripts do not use aten::add (#66161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161 `aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31406764 fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e	2021-10-06 12:37:39 -07:00
Mike Iovine	ed50fa2513	[Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849 Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31282901 fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9	2021-10-04 20:46:07 -07:00
Don Jang	89ed9bdaee	[Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516 This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors. Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values. This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (`3fb33b38b9`)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production. This change fixes the root cause from `_embedding_bag_cpu_impl_out` by replacing alias creation with copying. Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times: 1. profile run 2. run using the profile to allocate managed tensors 3. reuse the managed tensors -- newly added Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`. Test Plan: - Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248). - Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future. Reviewed By: hlu1 Differential Revision: D31104345 fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f	2021-10-03 15:10:58 -07:00
Scott Wolchok	ffede499b2	[PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499 When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc. ghstack-source-id: 139549027 Test Plan: Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options). Before: I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252 I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871 I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916 I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093 I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971 mean: 7.144172, stddev: 0.1283 After: I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122 I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579 I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033 I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833 I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85 mean: 6.800632, stddev: 0.013227 Looks like about a 5.3% improvement. Reviewed By: hlu1 Differential Revision: D31125492 fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823	2021-10-01 12:13:33 -07:00
Mike Iovine	5f7ab7be6f	[Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741 This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing. Reviewed By: hlu1 Differential Revision: D31234944 fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4	2021-09-29 08:04:20 -07:00
Kushashwa Ravi Shrimali	4752453d27	[Structured Kernels] Port for `baddbmm` and `bmm` (#64805 ) Summary: This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation. Issue tracker: https://github.com/pytorch/pytorch/issues/55070 cc: ysiraichi ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805 Reviewed By: gchanan Differential Revision: D31134454 Pulled By: ezyang fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261	2021-09-28 11:07:31 -07:00
Mike Iovine	ef9e560796	[Static Runtime] Add aten::remainder out variant (#64967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967 Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder` Reviewed By: d1jang Differential Revision: D30915469 fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9	2021-09-24 07:51:39 -07:00
Raghavan Raman	31584d065e	[Static Runtime] Added NNC implementation for signed log1p kernel. (#65387 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387 Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op. Also, added a SR microbenchmark for this kernel which shows the performance improvement. Without fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 1953 ns 1953 ns 358746 BM_signed_log1p/64 2049 ns 2049 ns 342145 BM_signed_log1p/512 3291 ns 3291 ns 214342 BM_signed_log1p/4096 15559 ns 15559 ns 44420 BM_signed_log1p/32768 101936 ns 101935 ns 6843 BM_signed_log1p/65536 194792 ns 194789 ns 3615 ``` With NNC fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 369 ns 369 ns 1896179 BM_signed_log1p/64 497 ns 497 ns 1406995 BM_signed_log1p/512 1618 ns 1618 ns 430209 BM_signed_log1p/4096 11327 ns 11326 ns 61463 BM_signed_log1p/32768 84099 ns 84086 ns 8325 BM_signed_log1p/65536 166531 ns 166510 ns 4186 ``` This clearly shows >15% improvement in performance of this kernel with NNC fusion. On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops: without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved) with NNC fusion: `0.55%` Test Plan: `buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1) ``` get 57220 prediction values get 57220 prediction values max_error: 0 total: 0 ``` Reviewed By: hlu1 Differential Revision: D30609492 fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd	2021-09-22 15:53:33 -07:00
Hao Lu	ce101fed02	[PyPer] copy-free freeze_module (#65118 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118 Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase. Reviewed By: eellison, movefast1990 Differential Revision: D30955053 fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0	2021-09-20 17:25:10 -07:00
Mike Iovine	99e4ab5d44	[Static Runtime] Implement and enable variadic tuple unpack (#64934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934 Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from: ``` %0, %1 = prim::TupleUnpack(%a) %2, %3 = prim::TupleUnpack(%b) ``` into: ``` %0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b) ``` The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack` Reviewed By: d1jang Differential Revision: D30872109 fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9	2021-09-20 10:36:11 -07:00
Don Jang	ae00075ac7	[Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123 This change re-reverts D30883290 (`0e11454d19`). D30883290 (`0e11454d19`) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`. ``` ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod); Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 auto sr = getStaticRuntime(); Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 2 errors generated. ``` This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine). Original Summary: This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp. `MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors. This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support. Test Plan: - Confirm that OSS build went well (See External Tests section). Reviewed By: mikeiovine Differential Revision: D30983292 fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116	2021-09-17 13:32:01 -07:00
Don Jang	8241193d76	[Static Runtime] Introduce static_runtime::dict_unpack (#64771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771 Test Plan: - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict` - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict` - TBD: Perf impact measurement Reviewed By: mikeiovine Differential Revision: D30685083 fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9	2021-09-16 23:25:13 -07:00
Don Jang	3fb33b38b9	[Static Runtime] Check if outputs of a node do not overlap with each other (#63013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013 This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs. This check will detect a problem like T97393697 immediately in debug mode. Test Plan: - Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs` - Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run. Reviewed By: hlu1 Differential Revision: D30211705 fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0	2021-09-15 08:38:05 -07:00
Mike Iovine	616fd9219d	[Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209 Add a new fusion pass that turns transforms the following pattern: ``` graph(%input): %0 : Tensor = aten::sign(%input) %1 : Tensor = aten::abs(%input) %2 : Tensor = aten::log1p(%1) %res : Tensor = aten::mul(%0, %2) return (%res) ``` Into a single op: ``` graph(%input): %res : Tensor = static_runtim::signed_log1p(%input) return (%res) ``` The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff does not enable this pass. Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Test passed with new graph pass disabled and enabled. Reviewed By: hlu1 Differential Revision: D30559929 fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a	2021-09-02 08:31:40 -07:00
Ray Peng	09e610e36d	[Static Runtime] Out version for softmax (#64243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243 Test Plan: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ... V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1) ... [ OK ] StaticRuntime.IndividualOps_Softmax (803 ms) ``` Reviewed By: hlu1 Differential Revision: D30656149 fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c	2021-08-31 18:33:26 -07:00
Harut Movsisyan	3c15822f5f	[Static Runtime] Implement aten::nonzero out variant (#64126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126 Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30617729 fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b	2021-08-31 00:51:15 -07:00
Harut Movsisyan	1f16c22dc8	[Static Runtime] Implement aten::cumsum out variant (#64159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30622819 fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736	2021-08-30 16:18:22 -07:00
Harut Movsisyan	e24c3644d8	[Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157 UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR. Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30598574 fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a	2021-08-30 09:42:38 -07:00
Harut Movsisyan	8af1407eab	[Static Runtime] Out version for torch.linalg.norm (#64070 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30595816 fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563	2021-08-29 21:00:11 -07:00
Don Jang	9f1f22b9bc	[Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081 This change add an out variant of `quantized::embedding_bag_byte_prepack`. Test Plan: - Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`. - Observed ``` V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input) ``` Reviewed By: hlu1 Differential Revision: D30504216 fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2	2021-08-27 10:53:23 -07:00
Harut Movsisyan	f2c47cf4db	[Static Runtime] Out version for fmod (#64046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046 Test Plan: Confirm out variant is used: ``` > //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1) ``` Reviewed By: mikeiovine Differential Revision: D30581228 fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74	2021-08-27 03:05:06 -07:00
Don Jang	c90b3cb1da	[Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078 This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime. Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it: ``` at::Tensor mean = create_empty_from({M}, X); at::Tensor rstd = create_empty_from({M}, X); ``` that the static runtime misses an opportunity to manage. This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors. Test Plan: - Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated. - Confirmed that the new op gets activated during testing: ``` V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3) ``` Reviewed By: hlu1 Differential Revision: D30486475 fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d	2021-08-27 02:44:43 -07:00
Don Jang	cbfec02007	[Static Runtime] Add native op for aten::expand_as (#64024 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024 `aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime. Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs` Reviewed By: hlu1 Differential Revision: D30546851 fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0	2021-08-26 13:05:53 -07:00
Hao Lu	6fa646ad54	[StaticRuntime] Fix bug in HasInplaceOp (#63842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842 Reviewed By: mikeiovine Differential Revision: D30506914 fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852	2021-08-24 17:07:45 -07:00
Mike Iovine	7774a4e95b	[Static Runtime] Implement prim::VarStack out variant (#63579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579 Provide a static runtime out variant implementation for the new op introduced in D30426232 (`1385f9fb12`). Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack` Reviewed By: navahgar Differential Revision: D30410525 fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8	2021-08-24 09:44:29 -07:00
Don Jang	84890aae35	[Static Runtime] Add an out variant op for aten::abs (#63675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675 This change adds an out variant implementation for `aten::abs`. Test Plan: - Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)` - Perf impact: TBD Reviewed By: hlu1 Differential Revision: D30461317 fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191	2021-08-23 16:25:10 -07:00
Hao Lu	b2a601ffe5	[Static Runtime] Implement out variant for fb::quantized_linear (#63635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635 Reviewed By: ajyu Differential Revision: D30446234 fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030	2021-08-20 21:42:22 -07:00
Don Jang	913c1f83f4	[Static Runtime] Add native op for aten::detach (#63625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625 This change adds a static runtime's native op implementation for `aten::detach` op. See the standard `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison. Test Plan: - Added `StaticRuntime.IndividualOps_Detach`. - Observed ``` V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1) ``` Reviewed By: hlu1 Differential Revision: D30443187 fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17	2021-08-20 00:46:27 -07:00
Mike Iovine	47a9e8ff32	[Static Runtime] Support __getitem__ for lists (#63398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398 This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter. Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D30368464 fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764	2021-08-19 06:38:51 -07:00
Mike Iovine	9d9e7a8d72	[Static Runtime] Implement aten::append (#63350 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350 Add a native implementation for `aten::append`, the list append op. Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append` Reviewed By: hlu1 Differential Revision: D30326461 fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3	2021-08-17 13:40:18 -07:00
Mike Iovine	078b8004a6	[Static Runtime] Implement prim::TupleUnpack (#63243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63243 Add `prim::TupleUnpack` native op to static runtime. Test Plan: Unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D30306955 fbshipit-source-id: 21923d6cbd5545c144ac051b3d48b37ec6e610cf	2021-08-16 14:56:30 -07:00
Mike Iovine	3dcd785cac	[Static Runtime] Add tests for all aten ops (#62347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347 This diff includes tests for all `aten` ops that did not already have test coverage. Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D29968280 fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2	2021-08-09 12:09:59 -07:00
Rong Rong (AI Infra)	7f1b672b7a	Revert D29952381: [Static Runtime] Ensure that unittests only use out variants or native ops Test Plan: revert-hammer Differential Revision: D29952381 (`8737e17af2`) Original commit changeset: e60e70b80ccf fbshipit-source-id: 59dc2f920b7ceaf94ba8f5f36024e7cc710f6645	2021-08-04 14:25:11 -07:00
Don Jang	8737e17af2	[Static Runtime] Ensure that unittests only use out variants or native ops (#62335 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335 This change ensures that unittests only use out variants or native ops. - Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that. - We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that. - `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op = true` since they are written to use interpreter ops by design. Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully. Reviewed By: mikeiovine, hlu1 Differential Revision: D29952381 fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702	2021-08-04 11:37:15 -07:00
Mike Iovine	34f50c6e35	[Static Runtime] testStaticRuntime verifies that # of nodes is at least 2 (#62622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622 This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used. Reviewed By: hlu1 Differential Revision: D30058050 fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b	2021-08-03 15:55:57 -07:00
Raghavan Raman	b91a917616	[Static Runtime] Fixed another build failure in OSS due to test_utils.h (#62338 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62338 Test Plan: Imported from OSS Reviewed By: d1jang Differential Revision: D29965744 Pulled By: navahgar fbshipit-source-id: cf3e54ac13432ea8afc4b718fac6c9768743d01b	2021-07-28 11:41:33 -07:00
Don Jang	68efa186cc	[static runtime] Implement aten::full (#62227 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62227 Test Plan: Added `StaticRuntime.IndividualOps_Full` to cover the newly added code path. Reviewed By: hlu1 Differential Revision: D29923649 fbshipit-source-id: 722950137c35ae325590a670b97f03b395e8eac3	2021-07-28 09:50:27 -07:00
Mike Iovine	e1bee3eb30	[Static Runtime] Add missing unit tests for static runtime ops (#62238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62238 Added tests for the following ops: * `aten::mul` * `aten::nan_to_num` * `aten::stack` * `aten::relu` * `aten::tanh` Reviewed By: hlu1 Differential Revision: D29914217 fbshipit-source-id: 6a6c39629310e7131127e24fdce7253ccdf80340	2021-07-27 14:12:21 -07:00
Raghavan Raman	60070982d2	[Static Runtime] Fixed build failure in OSS due to test_utils (#62216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62216 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D29917514 Pulled By: navahgar fbshipit-source-id: 379863e6cd0b157de3bfa1482f5519b26654b3d2	2021-07-26 16:10:10 -07:00
Mike Iovine	6007ad3529	[Static Runtime] Refactor fb op tests to use testStaticRuntime (#62064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064 `testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators` Reviewed By: hlu1 Differential Revision: D29858928 fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c	2021-07-26 08:25:10 -07:00

1 2 3

119 Commits