pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
cyy	45efa1aaa8	[3/N] Use internal linkage in C++ files (#151297 ) Follows #151070. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297 Approved by: https://github.com/Skylion007	2025-05-05 17:48:39 +00:00
Zhou Fang	fc5913b6bf	[StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728 ) (#146855 ) Summary: When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors. Differential Revision: D69195886 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855 Approved by: https://github.com/swolchok	2025-02-11 13:59:54 +00:00
Scott Wolchok	e558008a05	[PyTorch] Add test that canEnableStaticRuntime rejects prim::CallMethod (#120853 ) Rejecting prim::CallMethod is called out in a comment in impl.cpp, but doesn't seem to be tested. Now it is. Differential Revision: [D54338261](https://our.internmc.facebook.com/intern/diff/D54338261/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120853 Approved by: https://github.com/houseroad	2024-04-23 15:56:42 +00:00
Alan Ji	70b0f1b248	fix some typos (#106018 ) Fixes #ISSUE_NUMBER Fix typos in `test_static_module.cc`, `backend_cutting_test.cc` and `types_base.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106018 Approved by: https://github.com/awgu	2023-07-26 18:14:44 +00:00
Huy Do	8cb5c5543e	Revive static_runtime_benchmark build and test (#87660 ) This build uses the wrong BUILD_ENVIRONMENT `pytorch-linux-focal-py3`, thus it hasn't been run for a long time (forgotten). The name was probably the old name of the build environment we used in the past. The convention today doesn't have the `pytorch-` prefix. There is a TODO for this: > TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this. This is done as part of [T131829540](https://www.internalfb.com/intern/tasks/?t=131829540), where we want `static_runtime_benchmark` build and test jobs to run in OSS CI to avoid breaking internal * I also fix some compiler warning errors `-Werror=sign-compare`, `-Werror,-Wunused-const-variable`, and gcc7 compatibility issue along the way because this hasn't been run for a long time. * Reviving this test also reveals a small bug in `PrepackWeights` test in `test_static_runtime.cc` added recently in https://github.com/pytorch/pytorch/pull/85289. The test refers to an internal ops and should only be run internally. This has been fixed by https://github.com/pytorch/pytorch/pull/87799 (To be merged) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87660 Approved by: https://github.com/malfet	2022-11-08 08:32:45 +00:00
Mike Iovine	ed7a8ab436	[Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396 ) Summary: Someone was running into problems where 1) Static Runtime enablement would fail 2) We would try to fall back to the JIT interpreter after trying to create `StaticModule` 3) The fallback fails because Static Runtime mangled the graph. We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it. With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed. Test Plan: New unit test Differential Revision: D40564452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396 Approved by: https://github.com/tenpercent	2022-10-26 14:34:29 +00:00
Akshay Parashar	38169c2287	[Static Runtime] Fix precision error in test cases (#80935 ) Summary: - Test cases related to DeepAndWideSciptModel() was crashing at random due to precision issue - test cases related for precision: DeepWide, KWargsAPI_1, KWargsAPI_2, KWargsAPI_Optional, FusionPass - test failure was not observed always due to random input to the model (via torch::randn) - Increasing the absolute tolerance for test cases Differential Revision: D37639067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/80935 Approved by: https://github.com/mikeiovine	2022-07-06 16:31:18 +00:00
mikeiovine	02713221e3	[SR] Fuse clamp/nan_to_num Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094 Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor. Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)! Approved by: https://github.com/navahgar	2022-05-10 23:33:59 +00:00
Mike Iovine	849984a2cd	[SR] Sigmoid out variant calls fast_sigmoid (#75661 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661 `fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation). ghstack-source-id: 155604086 Reviewed By: navahgar, hlu1 Differential Revision: D35481390 fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1 (cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)	2022-05-06 00:14:30 +00:00
Mike Iovine	1fed6b7559	[SR] Eliminate extra permutes around softmax calls (#76391 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391 I've seen this pattern in many important internal models: ``` x = torch.permute(a, [0, 2, 1]) y = torch.softmax(x, 2) z = torch.permute(y, [0, 2, 1]) ``` This is equivalent to ``` z = torch.softmax(x, 1) ``` The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this. ghstack-source-id: 155466506 Test Plan: New unit tests Reviewed By: navahgar, huiguoo Differential Revision: D35938289 fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9 (cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)	2022-05-04 23:08:49 +00:00
Mike Iovine	b02b3f25db	[SR] Quick hack to eliminate no-op slice (#75774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774 `list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models. Test Plan: New unit tests Reviewed By: navahgar Differential Revision: D35632947 fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d (cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)	2022-05-03 19:29:46 +00:00
mikeiovine	98b4a4100d	[SR] Add a copy variant for fused_split_and_squeeze Pull Request resolved: https://github.com/pytorch/pytorch/pull/75660 The outputs of `split_and_squeeze` are passed to `VarStack` in models we care about. `VarStack` has a [fast path](https://www.internalfb.com/code/fbsource/[893193f5277184fd17f4ea3f28fe415a4df37707]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=296-298) for when all of its inputs have the same strides. Hitting the slow path adds a ton of extra overhead - so much that it's worth it to copy in `split_and_squeeze` and force all of `VarStack`'s inputs to be contiguous so we can take advantage of the fast path. Differential Revision: [D35513777](https://our.internmc.facebook.com/intern/diff/D35513777/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35513777/)! Approved by: https://github.com/hlu1	2022-04-13 20:02:01 +00:00
Mike Iovine	2ca66ffb7d	[SR] Force split_and_squeeze usage via graph transformation (#74274 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274 Reviewed By: navahgar Differential Revision: D34913889 fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458 (cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)	2022-03-29 19:13:40 +00:00
Mike Iovine	f5a9c36d0b	[SR] Eliminate extra permute ops before `aten::sum` (#74481 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f (cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)	2022-03-23 23:00:14 +00:00
Mike Iovine	818bf361b6	[SR] Fix a kwargs API default value bug (#73681 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681 Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values. ghstack-source-id: 150433627 Test Plan: Added unit test to cover this case Reviewed By: navahgar, d1jang Differential Revision: D34588804 fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755 (cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)	2022-03-03 22:31:37 +00:00
Don Jang	bbc59ff2bf	[Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536 Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious: 1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule` 2) fully instanced ones that are owned by `StaticRuntime`. We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created. To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2). Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D33985600 fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c (cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)	2022-03-02 22:33:32 +00:00
Mike Iovine	d2c0c0b638	[SR] Apply all graph passes to sub-blocks (#72598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72598 Apply all optimizations to sub-blocks by replacing loops over `graph->nodes()` with loops over nodes in `DepthFirstGraphNodeIterator` ghstack-source-id: 149155700 Test Plan: Existing unit tests Reviewed By: d1jang Differential Revision: D34111430 fbshipit-source-id: 015076030368bb67df24ed5892475534b8f8f272 (cherry picked from commit `a4314520de`)	2022-02-15 20:19:42 +00:00
Mike Iovine	cff5e22a72	[SR] Relax aten::__is__ constraint for SR enablement (#71807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71807 There's no need to completely disallow `aten::__is__` and `aten::__isnot__`. The only problematic case is when the comparison is between two tensors, e.g. in ``` def forward(x): y = x.detach() # Should be false, but we get True # after our EliminateNoOps pass return x is y ``` Test Plan: New unit test covers this case Reviewed By: d1jang Differential Revision: D33783668 fbshipit-source-id: c9f57fa96937ecce38a21554f12b69c45cc58fe4 (cherry picked from commit `019588f4ca`)	2022-02-03 12:18:46 +00:00
Mike Iovine	2d5296b0e7	[SR] Implement prim::Loop (#69838 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69838 Implement `prim::Loop` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186483 Test Plan: New unit tests: `buck test caffe2/benchmark/static_runtime/...` Reviewed By: d1jang Differential Revision: D33049595 fbshipit-source-id: 550de5167b46fccd65ff77d092785289b5e5d532 (cherry picked from commit `8baf1753af`)	2022-02-02 19:30:50 +00:00
Mike Iovine	2aa699505d	[SR] Implement prim::If (#69837 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69837 Implement `prim::If` with the new `StaticRuntimeBlockRunner` abstraction. ghstack-source-id: 148186475 Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime/...` Accuracy test at top of stack Reviewed By: d1jang Differential Revision: D33045908 fbshipit-source-id: 281fb4a73528249fa60f65ac26f8ae6737771f55 (cherry picked from commit `de3b12dc08`)	2022-02-02 19:30:50 +00:00
Mike Iovine	d2599701fd	[SR] Force sub-blocks to return at least one output (#69836 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69836 It is technically possible for the sub-blocks to return zero outputs. This is problematic for `StaticRuntimeBlockRunner`, because it assumes that at least one output is being returned. Rather than slowing down SR with special logic for this corner case, we can simply force these sub-blocks to return `None`. ghstack-source-id: 148186453 Test Plan: Sub-blocks with no return values tested at top of stack Reviewed By: d1jang Differential Revision: D33050420 fbshipit-source-id: 17d9e19fda6431aa9fd0b155131349bac42bc149 (cherry picked from commit `c97fd07bf5`)	2022-02-02 19:30:50 +00:00
Mike Iovine	238dded10f	[SR] Graph pass to create owned refs of special IValues (#69835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69835 `StaticRuntimeBlockRunner` moves its outputs to the return value at the end of `run_impl`. However, there's a corner case where this can cause problems. If we return a constant, then the only reference in the `constants_` array can be destroyed by this move. We could add special logic to handle this in `run_impl`. But since this is a relatively rare corner case, it's simpler to just add an op that does nothing but create an owned reference to its input. This owned reference can be safely moved out of `StaticRuntimeBlockRunner`. Note that this also applies to returned values in sub-blocks that are from outer scopes. ghstack-source-id: 148186452 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Added a new unit test with a graph that simply returns a constant. Tests with sub-blocks at top of stack. Reviewed By: d1jang Differential Revision: D33047519 fbshipit-source-id: 22b6058f0d1da8a6d1d61a6f2866bc518bff482b (cherry picked from commit `a8f89a12ee`)	2022-02-02 19:30:50 +00:00
Mike Iovine	4b789df68b	[SR] Add BlockRunner and handle sub-blocks (#69834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69834 * Modify the `StaticModule` constructor to handle index initialization for sub-blocks. * Add a new class `StaticRuntimeBlockRunner`. This class is almost exactly like what we've been calling `StaticRuntime` up to this point, except that it does not own a `values_` array. All `StaticRuntimeBlockRunners` hold an unowned reference to a `values_` array owned by `StaticRuntime`. This is a useful abstraction for implementing control flow - it gives us a way for sub-blocks to look up values from surrounding scopes! ghstack-source-id: 148086245 Test Plan: `buck test caffe2/benchmarks/static_runtime/...` Reviewed By: d1jang Differential Revision: D33028039 fbshipit-source-id: 4f01417bad51a0cf09b1680a518308da647be1f6 (cherry picked from commit `3a9feffd92`)	2022-02-01 17:20:55 +00:00
Scott Wolchok	3a77fb244b	[PyTorch][Static Runtime] Delete cleanup_activations option (#71501 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71501 This option disabled the memory planner. Supporting it would require us to add multiple versions of ops that borrow their inputs (because they rely on the memory planner to support that), and I'm not aware of a particular need to continue supporting it. ghstack-source-id: 147385569 Test Plan: CI, rerun broken test from task Reviewed By: mikeiovine Differential Revision: D33669290 fbshipit-source-id: ecb01995891aecb5f4d0da2d9c51eed1f8fe489a (cherry picked from commit `5e4fefb109`)	2022-01-21 18:15:43 +00:00
Mike Iovine	873585da2b	[SR] Improve set_inputs (#69087 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69087 This diff includes a variety of improvements to `set_inputs` to unify behavior with `torch::jit::Module`: 1. Eliminate code duplication between rvalue/lvalue overloads 2. Add type checks 3. Make input length check a `TORCH_CHECK` instead of a debug check - we have to fail when the wrong number of inputs are passed. 4. `schema` now always includes `self`, even if we release `module_`. This is consistent with `torch::jit::Module`.\| ghstack-source-id: 145599837 Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32711705 fbshipit-source-id: fe97c10b4f03801ba59868b452e7d02b26b3106b	2021-12-15 09:31:19 -08:00
Donald Dong	f7294cd865	[Static Runtime] Skip ReplaceWithCopy when inputs have writters (#69819 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69819 We should skip ReplaceWithCopy if the inputs to the operator can be updated during inference. For a set of tensors that share data, ReplaceWithCopy should not happen to any of them if there exists updates to any of them. Currently, the check in place has missed some cases (suppose there exists updates, and uses <= 1). This diff addresses the missing cases by querying AliasDB. Test Plan: - Added test cases, including a one that is problematic before this diff - CI Reviewed By: mikeiovine Differential Revision: D33052562 fbshipit-source-id: 61f87e471805f41d071a28212f2f457e8c6785e7	2021-12-14 09:39:49 -08:00
Mike Iovine	f87f1d08e8	[SR] assignStorageToManagedTensors returns a vector (#69568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69568 Non-empty vectors should never be passed to `assignStorageToManagedTensors` and `assignStorageToManagedOutputTensors`. Presumably, this out-variant convention was adopted to avoid move-assigning the corresponding attribtues in `MemoryPlanner`. But the cost of a vector move-assign is not high, and this function type signature is safer. Test Plan: `buck test caffe2/bechmarks/static_runtime:static_runtime_cpptest` Reviewed By: donaldong Differential Revision: D32729289 fbshipit-source-id: 88f19de8eb89d8a4f1dd8bbd4d9e7f686e41888b	2021-12-09 17:01:48 -08:00
Don Jang	9aa1b3e396	[Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595 This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it. Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D32908341 fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364	2021-12-09 15:11:03 -08:00
Mike Iovine	008469c5e2	[SR] Simplify memory re-use algorithm (#68302 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302 Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis. Test Plan: ## Re-use metrics `inline_cvr` (294738512_58) Before * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4601984 bytes Total number of reused tensors: 1183 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29696 bytes Total number of reused tensors: 959 ``` After * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4520000 bytes Total number of reused tensors: 1198 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29120 bytes Total number of reused tensors: 963 ``` Reviewed By: hlu1 Differential Revision: D32370424 fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf	2021-12-07 13:25:42 -08:00
Hao Lu	ed3b73fd4d	[Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639 Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()` - Only enable this check for native and fallback ops that are not inplace or view ops - Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it - Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (`39ab417107`) Reviewed By: mikeiovine Differential Revision: D32553708 fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13	2021-12-02 05:03:12 -08:00
Mike Iovine	ee4cfaa286	[SR] Add utility class to determine tensor ranges (#68284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284 Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: swolchok Differential Revision: D32397207 fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6	2021-11-19 13:10:55 -08:00
Don Jang	aa9ee8d02a	[Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368 Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation. However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable. This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance. This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests. Thanks to hlu1 for proposing this non-intrusive improvement idea :D Test Plan: This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed. ==AFTER * CMF/local memory turnover: 393608 latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087 * CMF/local_ro memory turnover:387288 latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101 ==BEFORE * CMF/local memory turnover: 459888 latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18 * CMF/local_ro memory turnover: 420832 latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453 ==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr: ==AFTER Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ==BEFORE Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) Reviewed By: swolchok Differential Revision: D32337548 fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a	2021-11-16 20:28:48 -08:00
Scott Wolchok	639258499f	[PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935 Rationale should be documented in code comments. In short, we can avoid heap-allocating arrays of input indexes for operators with 5 or fewer inputs, at the cost of a tag bit check on access. ghstack-source-id: 143429112 Test Plan: Patched d1jang's D32181666, which prints static runtime memory usage. Previous diff, local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 ``` This diff, local: ``` I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064 ``` 4.5% savings (16144 bytes) Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff. Reviewed By: hlu1 Differential Revision: D32216573 fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c	2021-11-16 10:21:12 -08:00
Scott Wolchok	6acde23bec	[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934 This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode. ghstack-source-id: 143429113 Test Plan: Patched d1jang's diff to measure memory turnover around SR startup. Previous diff, CMF local: ``` I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120 ``` This diff, CMF local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 72912 bytes (17%) savings ``` Perf looks neutral; see next diff (D32216573) test plan for details. Reviewed By: hlu1 Differential Revision: D32190751 fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc	2021-11-16 10:19:50 -08:00
Don Jang	9cb65df79f	[Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939 With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc). Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero. This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely. A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate. Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code. Reviewed By: swolchok Differential Revision: D32219731 fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf	2021-11-11 17:31:07 -08:00
Hao Lu	47bc47f2b9	[SR] Add runtime check to correct bad schema alias info (#67825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825 The comment explains how it works. Test Plan: A small regression to local and local_ro if we only enable it for fallback ops. ``` ## local_ro # before I1103 21:25:05.250440 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247 I1103 21:25:08.629221 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319 I1103 21:25:12.005179 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759 I1103 21:25:12.005236 `2636751` PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619 # after # # only enable for fall back ops: 0.7% I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481 I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262 I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51 I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424 # enable for all (no clone): 4.7% I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005 I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14 I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506 I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765 # local # before I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233 I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641 I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735 I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206 # after # # only enable for fall back ops: 0.1% I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349 I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601 I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659 I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658 # enable for all (no clone): 0.9% I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783 I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611 I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037 I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498 ``` Reviewed By: d1jang Differential Revision: D32124812 fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa	2021-11-10 19:35:11 -08:00
Hao Lu	1b2a366932	[SR] Enforce checks for resizing of the internal buffer in MemoryPlanner in unit tests (#67941 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941 I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests. From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in. Test Plan: Facebook ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` Reviewed By: swolchok Differential Revision: D32196204 fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71	2021-11-09 16:07:40 -08:00
Shashank Chaudhry	89c4e8c22b	[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8	2021-11-03 12:23:14 -07:00
Mike Iovine	f2582a59d0	[SR] Add rvalue overload for operator() (#66648 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648 Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes: 1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`. 2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`. Old: ``` operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs) ``` New: ``` template <class IValueList> operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs) ``` The implementations use perfect forwarding to invoke the correct overload of `set_inputs`. Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass. Reviewed By: hlu1 Differential Revision: D31659973 fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed	2021-10-22 10:51:47 -07:00
Hao Lu	6634570aef	[SR] Fix bug in ValueGroup (#66470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470 Reviewed By: d1jang Differential Revision: D31566348 fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3	2021-10-13 19:26:38 -07:00
Don Jang	416f593080	[Static Runtime] Group graph nodes into input aliases & output aliases (#65517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517 This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows: - input_aliases: values that are either inputs or contain aliases of inputs or constants. - output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases. - Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph. `output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance. Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before. Reviewed By: hlu1 Differential Revision: D30940955 fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc	2021-10-07 14:35:12 -07:00
Mike Iovine	ed50fa2513	[Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849 Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31282901 fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9	2021-10-04 20:46:07 -07:00

42 Commits