pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-08 07:39:33 +01:00

Author	SHA1	Message	Date
Don Jang	9aa1b3e396	[Static Runtime] [Code Cleanup] Encapsulate function objects within ProcessedFunction (#69595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595 This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it. Test Plan: Existing tests Reviewed By: mikeiovine Differential Revision: D32908341 fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364	2021-12-09 15:11:03 -08:00
Mike Iovine	1c43b1602c	[SR] Scope exit guard for memory planner deallocation (#68795 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795 This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor. Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on. There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here. Test Plan: New unit tests `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D32609915 fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e	2021-12-08 16:41:52 -08:00
Mike Iovine	008469c5e2	[SR] Simplify memory re-use algorithm (#68302 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302 Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis. Test Plan: ## Re-use metrics `inline_cvr` (294738512_58) Before * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4601984 bytes Total number of reused tensors: 1183 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29696 bytes Total number of reused tensors: 959 ``` After * `local` ``` Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 4520000 bytes Total number of reused tensors: 1198 ``` * `local_ro` ``` Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 29120 bytes Total number of reused tensors: 963 ``` Reviewed By: hlu1 Differential Revision: D32370424 fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf	2021-12-07 13:25:42 -08:00
Don Jang	9663e08674	[Static Runtime] Fix a bug that aten::embedding_bag keeps cannot handle resized input tensors (#69219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219 This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts. Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change. Reviewed By: mikeiovine Differential Revision: D32544399 fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29	2021-12-03 19:01:45 -08:00
Scott Wolchok	b22e4d4aea	[PyTorch][SR] Add more to() tests & extend debug logging in testStaticRuntime (#67219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219 I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2. ghstack-source-id: 144507287 Test Plan: Used during development of D31776259 Reviewed By: hlu1 Differential Revision: D31847566 fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3	2021-12-02 10:34:54 -08:00
Hao Lu	ed3b73fd4d	[Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639 Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()` - Only enable this check for native and fallback ops that are not inplace or view ops - Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it - Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (`39ab417107`) Reviewed By: mikeiovine Differential Revision: D32553708 fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13	2021-12-02 05:03:12 -08:00
Mike Iovine	ee4cfaa286	[SR] Add utility class to determine tensor ranges (#68284 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284 Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: swolchok Differential Revision: D32397207 fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6	2021-11-19 13:10:55 -08:00
Ben Koopman	c2c859bdf2	[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3	2021-11-19 06:29:19 -08:00
CodemodService FBSourceClangFormatLinterBot	143491e0ad	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D32484422 fbshipit-source-id: 5c836dc7d06f12e64cc4bb1e85d8fa4b62a29b85	2021-11-17 07:27:04 -08:00
jjsjann123	0dc3f829d9	Nvfuser code bump 11 5 (#67943 ) Summary: nvfuser code update: 1. Tuning heuristics on schedulers for reduction/normalization kernels; 2. bfloat16 on IO tensor support; 3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last; 4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`. Things that are reverted from our local branch: 1. changes on some entries in autodiff 2. aten::gelu with approximation 3. native_dropout(_backward) Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943 Reviewed By: ngimel Differential Revision: D32288709 Pulled By: dzhulgakov fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1	2021-11-17 01:22:17 -08:00
Don Jang	aa9ee8d02a	[Static Runtime] Avoid copying function objects per StaticRuntime instance (#68368 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368 Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation. However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable. This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance. This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests. Thanks to hlu1 for proposing this non-intrusive improvement idea :D Test Plan: This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed. ==AFTER * CMF/local memory turnover: 393608 latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087 * CMF/local_ro memory turnover:387288 latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101 ==BEFORE * CMF/local memory turnover: 459888 latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18 * CMF/local_ro memory turnover: 420832 latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453 ==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr: ==AFTER Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) ==BEFORE Total number of managed tensors: 2660 Total number of managed output tensors: 0 Total number of unmanaged values: 3041 Total memory managed: 1496896 bytes Total number of reused tensors: 1183 Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%) Total number of managed tensors: 1412 Total number of managed output tensors: 0 Total number of unmanaged values: 2677 Total memory managed: 39040 bytes Total number of reused tensors: 959 Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%) Total number of managed tensors: 1293 Total number of managed output tensors: 0 Total number of unmanaged values: 14 Total memory managed: 5293824 bytes Total number of reused tensors: 771 Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%) Reviewed By: swolchok Differential Revision: D32337548 fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a	2021-11-16 20:28:48 -08:00
Michael Suo	5c3529a86d	[lint] small pass to make lint clean (#68367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2	2021-11-16 10:27:00 -08:00
Scott Wolchok	639258499f	[PyTorch][Static Runtime] Add & use "small array" for ProcessedNodeInputs (#67935 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935 Rationale should be documented in code comments. In short, we can avoid heap-allocating arrays of input indexes for operators with 5 or fewer inputs, at the cost of a tag bit check on access. ghstack-source-id: 143429112 Test Plan: Patched d1jang's D32181666, which prints static runtime memory usage. Previous diff, local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 ``` This diff, local: ``` I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064 ``` 4.5% savings (16144 bytes) Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff. Reviewed By: hlu1 Differential Revision: D32216573 fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c	2021-11-16 10:21:12 -08:00
Scott Wolchok	6acde23bec	[PyTorch][Static Runtime] Switch input/output repr to 2-byte offsets (#67934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934 This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode. ghstack-source-id: 143429113 Test Plan: Patched d1jang's diff to measure memory turnover around SR startup. Previous diff, CMF local: ``` I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120 ``` This diff, CMF local: ``` I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208 72912 bytes (17%) savings ``` Perf looks neutral; see next diff (D32216573) test plan for details. Reviewed By: hlu1 Differential Revision: D32190751 fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc	2021-11-16 10:19:50 -08:00
Don Jang	9cb65df79f	[Static Runtime] Fallback to disabling manage_output_tensors instead of crashing when wrong API is used (#67939 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939 With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc). Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero. This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely. A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate. Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code. Reviewed By: swolchok Differential Revision: D32219731 fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf	2021-11-11 17:31:07 -08:00
Hao Lu	47bc47f2b9	[SR] Add runtime check to correct bad schema alias info (#67825 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825 The comment explains how it works. Test Plan: A small regression to local and local_ro if we only enable it for fallback ops. ``` ## local_ro # before I1103 21:25:05.250440 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247 I1103 21:25:08.629221 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319 I1103 21:25:12.005179 `2636751` PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759 I1103 21:25:12.005236 `2636751` PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619 # after # # only enable for fall back ops: 0.7% I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481 I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262 I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51 I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424 # enable for all (no clone): 4.7% I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005 I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14 I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506 I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765 # local # before I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233 I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641 I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735 I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206 # after # # only enable for fall back ops: 0.1% I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349 I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601 I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659 I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658 # enable for all (no clone): 0.9% I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783 I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611 I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037 I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498 ``` Reviewed By: d1jang Differential Revision: D32124812 fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa	2021-11-10 19:35:11 -08:00
Mike Iovine	ecd5b1a8d4	[SR] Native implementation for aten::split (#67476 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31994040 fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a	2021-11-10 10:23:03 -08:00
Hao Lu	1b2a366932	[SR] Enforce checks for resizing of the internal buffer in MemoryPlanner in unit tests (#67941 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941 I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests. From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in. Test Plan: Facebook ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` Reviewed By: swolchok Differential Revision: D32196204 fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71	2021-11-09 16:07:40 -08:00
David Berard	b546cdf401	[SR] Out variant for prim::NumToTensor (#67856 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856 Returns a tensor constructed from scalar input Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Ran ``` buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=NumToTensorScalar --v=1 ``` and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`. Reviewed By: mikeiovine Differential Revision: D32014194 fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542	2021-11-08 09:02:58 -08:00
Mike Iovine	5bc89275dd	[SR] Eliminate no-ops (#67437 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437 Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this. Test Plan: `buck test caffe2/test:jit -- test_freezing` Reviewed By: hlu1 Differential Revision: D31980843 fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0	2021-11-08 08:42:33 -08:00
Bert Maher	4b084bc832	Benchmarks for various fusers (#67622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67622 Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D32171063 Pulled By: bertmaher fbshipit-source-id: 40d3a7adcc52aba3b051e382ec5ec4ee7e43d81b	2021-11-04 18:57:17 -07:00
Hao Lu	938bab0bfd	[PyTorch] Add int version of vectorized PrefixSum to Benchmark (#67865 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67865 - Add int version of vectorized PrefixSum - Use unaligned load/store instructions - Add exclusive scan version. "exclusive" means that the i-th input element is not included in the i-th sum. For details see https://en.cppreference.com/w/cpp/algorithm/exclusive_scan Test Plan: ``` buck build mode/opt-clang //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 numactl -m 0 -C 5 \ ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench ``` For full benchmark results, see P465274613 ``` PrefixSumBench/LocalInt/64 57 ns 56 ns 12414048 GB/s=9.06239G/s PrefixSumBench/LocalInt/256 221 ns 221 ns 3160853 GB/s=9.28635G/s PrefixSumBench/LocalInt/1024 818 ns 817 ns 857922 GB/s=10.0235G/s PrefixSumBench/LocalInt/4096 3211 ns 3210 ns 217614 GB/s=10.2093G/s PrefixSumBench/LocalInt/16384 12806 ns 12804 ns 54805 GB/s=10.2364G/s PrefixSumBench/LocalInt/65536 51115 ns 51079 ns 13741 GB/s=10.2643G/s PrefixSumBench/LocalInt/262144 205974 ns 205912 ns 3401 GB/s=10.1847G/s PrefixSumBench/LocalInt/1048576 829523 ns 828859 ns 845 GB/s=10.1207G/s PrefixSumBench/LocalIntAVX2/64 45 ns 45 ns 15568113 GB/s=11.3549G/s PrefixSumBench/LocalIntAVX2/256 208 ns 208 ns 3371174 GB/s=9.86913G/s PrefixSumBench/LocalIntAVX2/1024 893 ns 892 ns 783154 GB/s=9.18629G/s PrefixSumBench/LocalIntAVX2/4096 3618 ns 3613 ns 193834 GB/s=9.06838G/s PrefixSumBench/LocalIntAVX2/16384 14416 ns 14411 ns 48564 GB/s=9.09543G/s PrefixSumBench/LocalIntAVX2/65536 57650 ns 57617 ns 12156 GB/s=9.09952G/s PrefixSumBench/LocalIntAVX2/262144 230855 ns 230612 ns 3035 GB/s=9.09386G/s PrefixSumBench/LocalIntAVX2/1048576 924265 ns 923777 ns 758 GB/s=9.08077G/s PrefixSumBench/LocalIntAVX512/64 23 ns 23 ns 24876551 GB/s=22.0697G/s PrefixSumBench/LocalIntAVX512/256 95 ns 95 ns 7387386 GB/s=21.556G/s PrefixSumBench/LocalIntAVX512/1024 435 ns 435 ns 1609682 GB/s=18.8425G/s PrefixSumBench/LocalIntAVX512/4096 1815 ns 1815 ns 385462 GB/s=18.0561G/s PrefixSumBench/LocalIntAVX512/16384 7479 ns 7476 ns 93660 GB/s=17.5335G/s PrefixSumBench/LocalIntAVX512/65536 30171 ns 29879 ns 23430 GB/s=17.5468G/s PrefixSumBench/LocalIntAVX512/262144 125805 ns 125631 ns 5570 GB/s=16.6929G/s PrefixSumBench/LocalIntAVX512/1048576 504216 ns 503983 ns 1384 GB/s=16.6446G/s PrefixSumBench/ExclusiveScanIntAVX512/64 23 ns 23 ns 30058295 PrefixSumBench/ExclusiveScanIntAVX512/256 101 ns 101 ns 7398498 PrefixSumBench/ExclusiveScanIntAVX512/1024 435 ns 434 ns 1403877 PrefixSumBench/ExclusiveScanIntAVX512/4096 1979 ns 1978 ns 354016 PrefixSumBench/ExclusiveScanIntAVX512/16384 7828 ns 7819 ns 89551 PrefixSumBench/ExclusiveScanIntAVX512/65536 31206 ns 31192 ns 22408 PrefixSumBench/ExclusiveScanIntAVX512/262144 130106 ns 130023 ns 5388 PrefixSumBench/ExclusiveScanIntAVX512/1048576 525515 ns 524976 ns 1244 ``` Reviewed By: navahgar, swolchok Differential Revision: D32011740 fbshipit-source-id: 7962de710bd588291dd6bf0c719f579c55f7c063	2021-11-04 14:00:19 -07:00
Bin Wen	1baed45c6b	[fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67663 mostly follow the example of quantized::linear (D28428734 (`4d7abdbdad`)) to enable out-variant for quantized::linear_dynamic_fp16. Reason being from MP tab ctr pytorch model migration, we observe quantized::linear_dynamic_fp16 operator has highest cost but not enable out-variant yet https://fburl.com/phabricator/b5juus2d Test Plan: buck build mode/opt caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench sudo watch -n 20 /usr/local/fbprojects/dynamoserver/bin/turboDriver disable MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- --scripted_model=/home/bwen/models/991103061_4/991103061_4.predictor --pt_inputs=/home/bwen/models/991103061_4/pt_inputs --method_name=forward --pt_cleanup_activations=1 --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=1000 --warmup_iters=1000 --num_threads=1 --repetitions=3 --do_profile=1 --do_benchmark=1 --set_compatibility=1 --compare_results=1 --pt_enable_static_runtime 2>&1 \| pastry before: P465201159 0.929067 ms. 31.808%. quantized::linear_dynamic_fp16 (16 nodes) 0.921679 ms. 31.7324%. quantized::linear_dynamic_fp16 (16 nodes) 0.919127 ms. 31.7404%. quantized::linear_dynamic_fp16 (16 nodes) after: P465203015 0.90898 ms. 31.0205%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.9127 ms. 30.62%. quantized::linear_dynamic_fp16 (16 nodes, out variant) 0.879148 ms. 31.0161%. quantized::linear_dynamic_fp16 (16 nodes, out variant) unit test logic refers https://fburl.com/code/vv0rry13 buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest Reviewed By: hlu1 Differential Revision: D32001168 fbshipit-source-id: 873d9f77434b9c4bafb298c871173f9a560dd2a3	2021-11-03 22:39:04 -07:00
Hao Lu	89b02fc70b	[StaticRuntime][Easy] Correct typos in test_static_runtime (#67739 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67739 Test Plan: ``` buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest ``` Reviewed By: mikeiovine Differential Revision: D32125879 fbshipit-source-id: bd989e5088edff87624b858bd9045dfe9da3fbe7	2021-11-03 13:24:46 -07:00
Shashank Chaudhry	89c4e8c22b	[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8	2021-11-03 12:23:14 -07:00
Scott Wolchok	82f7f8d471	[PyTorch] Adopt IValue::toTupleRef() where obvious (#65505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505 Generated with `fastmod -m 'toTuple(\s)->' 'toTupleRef()${1}.'` , followed by `fastmod '(std::move$.)toTupleRef\($.' '${1}toTuple()->'` to unbreak 2 callsites. ghstack-source-id: 142065835 Test Plan: CI Reviewed By: gchanan Differential Revision: D31131025 fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34	2021-11-02 10:22:18 -07:00
Mike Iovine	39ad7b670e	[SR] Native implementation for aten::squeeze (#67441 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31992093 fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e	2021-11-01 08:22:57 -07:00
Mike Iovine	0d7cf825fc	[SR] Drop support for aten::__is__ and aten::__isnot__ (#67550 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550 `aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations. Some examples: - Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this: ``` def forward(x): y = x.detach() return x is y ``` We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`. - `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards ``` def forward(x): y = x.to(x.dtype) return x is y ``` - And many more, `FuseListUnpack` can break too Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D32022584 fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e	2021-11-01 04:45:14 -07:00
Mike Iovine	354363b57a	[SR] Native implementation for aten::size (#67346 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: d1jang Differential Revision: D31965159 fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5	2021-10-28 14:18:17 -07:00
Mike Iovine	afb8434440	[SR] Native implementation for aten::view (#67341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341 Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31962589 fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435	2021-10-28 13:37:46 -07:00
Bin Wen	6900aacf54	[fbcode] Fix operator_benchmark with jit mode (#67382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f	2021-10-28 08:48:10 -07:00
Mike Iovine	7da9c4ed2e	[SR] NNC out variant for aten::where (#67255 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255 Add an out variant for `aten::where`. Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well. Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: navahgar Differential Revision: D31923886 fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837	2021-10-28 06:48:22 -07:00
Hao Lu	9ebc6357b3	[SR] Vectorize int version of fmod (#67313 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67313 Reviewed By: swolchok Differential Revision: D31889868 fbshipit-source-id: a0af399431a0d672fa56cf2f2ba6d548c47bcedd	2021-10-27 17:02:53 -07:00
Mike Iovine	a0495b3cdb	[SR] Remove unused operator() overload (#67001 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001 The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere. This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one. Test Plan: `buck test caffe2/benchmarks/static_runtime/...` `buck test caffe2/test:static_runtime` Reviewed By: hlu1 Differential Revision: D31821990 fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a	2021-10-25 08:18:58 -07:00
Mike Iovine	f2582a59d0	[SR] Add rvalue overload for operator() (#66648 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648 Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes: 1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`. 2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`. Old: ``` operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs) ``` New: ``` template <class IValueList> operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs) ``` The implementations use perfect forwarding to invoke the correct overload of `set_inputs`. Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass. Reviewed By: hlu1 Differential Revision: D31659973 fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed	2021-10-22 10:51:47 -07:00
Aditya Pillai	40a8a50913	Add static_runtime::fused_equally_split (#2 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch-canary/pull/2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66881 Adds `static_runtime::fused_equally_split` operator and removes `is_fused` logic from original operator. Modifies `FuseUnpackListV2` to map `fb::equally_split` to this new operator. Test Plan: ``` adityapillai@5960 /data/sandcastle/boxes/fbsource/fbcode 1m 13s ❯ buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators ``` and sandcastle strange_what_could_go_wrong Reviewed By: mikeiovine Differential Revision: D31742293 fbshipit-source-id: 60b35589c8817719b005d49811f575b6590d1c39	2021-10-22 10:26:49 -07:00
Don Jang	18bbc4c2b7	[Static Runtime] Fix a bug in aten::index (#66940 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940 `aten::index`'s schema is as follows: ``` "aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor ``` The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor. Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`. Reviewed By: hlu1 Differential Revision: D31712145 fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3	2021-10-20 15:51:21 -07:00
lezcano	0974215c4d	Prefer mT and mH over transpose(-2, -1) and transpose(-2, -1).conj() (#64181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181 This PR replaces all the calls to: - `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python - `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python. It also simplifies two pieces of code, and fixes one bug where a pair of parentheses were missing in the function `make_symmetric_matrices`. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31692896 Pulled By: anjali411 fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a	2021-10-18 13:02:25 -07:00
Xue Li	2f099c7555	Revert D30652629: use irange for loops Test Plan: revert-hammer Differential Revision: D30652629 (`687c2267d4`) Original commit changeset: 0ae6c4bbbb55 fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3	2021-10-15 15:23:10 -07:00
Richard Barnes	687c2267d4	use irange for loops (#66234 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234 Modified loops in files under fbsource/fbcode/caffe2/ from the format `for(TYPE var=x0;var<x_max;x++)` to the format `for(const auto var: irange(xmax))` This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand. bypass_size_limit allow-large-files Test Plan: Sandcastle Reviewed By: ngimel Differential Revision: D30652629 fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e	2021-10-15 13:50:33 -07:00
Vasiliy Kuznetsov	d802877dfa	speed up quantized interpolate for channels last (#66525 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459	2021-10-14 08:11:26 -07:00
Hao Lu	6634570aef	[SR] Fix bug in ValueGroup (#66470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66470 Reviewed By: d1jang Differential Revision: D31566348 fbshipit-source-id: e0f634af77d893bbc8d66f214b2b8bdd6ab58cc3	2021-10-13 19:26:38 -07:00
Scott Wolchok	d30397d42a	[PyTorch][Static Runtime] Don't use vector in ProcessedNode (#65429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429 The sizes of these arrays can't change, so there's no need to waste an extra pointer on them. ghstack-source-id: 140532722 Test Plan: CI I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%. Reviewed By: hlu1 Differential Revision: D31096042 fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395	2021-10-13 19:13:20 -07:00
Mike Iovine	37db650c9c	[Static Runtime] Clone test does not use uninitialized memory (#66557 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557 The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31611961 fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8	2021-10-13 14:02:34 -07:00
Don Jang	736fa09a9a	[Static Runtime] Manage output tensors (#65515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515 This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows: - At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff. - At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple). - The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client. - In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory. - Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies. This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed. Test Plan: - Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths. - Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`). Reviewed By: hlu1 Differential Revision: D31049221 fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383	2021-10-11 09:50:54 -07:00
Don Jang	416f593080	[Static Runtime] Group graph nodes into input aliases & output aliases (#65517 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517 This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows: - input_aliases: values that are either inputs or contain aliases of inputs or constants. - output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases. - Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph. `output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance. Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before. Reviewed By: hlu1 Differential Revision: D30940955 fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc	2021-10-07 14:35:12 -07:00
Mike Iovine	d5f64afc38	[Static Runtime] Support aten::to.prim_dtype overload (#64928 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64928 Added support this overload of `aten::to`: ``` aten::to.prim_dtype(Tensor(a) self, int? dtype, bool non_blocking=False, bool copy=False) -> Tensor(a\|b) ``` Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_to` Reviewed By: hlu1 Differential Revision: D30901398 fbshipit-source-id: 38ce807c30185e92dd472b404b362f22ac7e4efb	2021-10-07 10:22:44 -07:00
Mike Iovine	6d7fab5929	[Static Runtime][easy] Clone scripts do not use aten::add (#66161 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161 `aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31406764 fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e	2021-10-06 12:37:39 -07:00
Alexandr Guzhva	b8e1999253	[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d	2021-10-06 08:07:42 -07:00
Mike Iovine	ed50fa2513	[Static Runtime] Test isOptimizableContainerType and getAlwaysAliveValues (#65849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849 Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D31282901 fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9	2021-10-04 20:46:07 -07:00
Nikita Shulga	4c4525fa5c	Compile without -Wno-unused-variable (take 2) (#66041 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Do not delete `caffe2::OperatorBase::Output` calls as they have side effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041 Reviewed By: ngimel Differential Revision: D31360142 Pulled By: malfet fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8	2021-10-04 20:39:39 -07:00
Don Jang	89ed9bdaee	[Static Runtime] Fix bug of creating output aliases in aten::embedding_bag (#65516 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516 This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors. Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values. This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (`3fb33b38b9`)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production. This change fixes the root cause from `_embedding_bag_cpu_impl_out` by replacing alias creation with copying. Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times: 1. profile run 2. run using the profile to allocate managed tensors 3. reuse the managed tensors -- newly added Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`. Test Plan: - Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248). - Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future. Reviewed By: hlu1 Differential Revision: D31104345 fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f	2021-10-03 15:10:58 -07:00
Nikita Shulga	e4ee5ca698	Revert D31326599: [pytorch][PR] Compile without -Wno-unused-variable Test Plan: revert-hammer Differential Revision: D31326599 (`a6280ab653`) Original commit changeset: 924155f1257a fbshipit-source-id: b8ee5bc0298637443232f5ee9ec79e51ed256faf	2021-10-01 20:40:47 -07:00
Nikita Shulga	a6280ab653	Compile without -Wno-unused-variable (#65954 ) Summary: Delete `-Wno-unused-variable` from top level `CMakeLists.txt` Still suppress those warnings for tests and `torch_python` Delete number of unused variables from caffe2 code Use `(void)var;` to suppress unused variable in range loops Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954 Reviewed By: ngimel Differential Revision: D31326599 Pulled By: malfet fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3	2021-10-01 17:40:47 -07:00
Scott Wolchok	ffede499b2	[PyTorch][Static Runtime] Fast path for contiguous to_copy (#65499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499 When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc. ghstack-source-id: 139549027 Test Plan: Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options). Before: I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252 I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871 I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916 I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093 I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971 mean: 7.144172, stddev: 0.1283 After: I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122 I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579 I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033 I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833 I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85 mean: 6.800632, stddev: 0.013227 Looks like about a 5.3% improvement. Reviewed By: hlu1 Differential Revision: D31125492 fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823	2021-10-01 12:13:33 -07:00
Vasiliy Kuznetsov	e3af4be963	pytorch quantization ao migration phase 2: caffe2/benchmark (#65833 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 \| xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116	2021-10-01 06:17:36 -07:00
Mikhail Zolotukhin	3a0165da49	[TensorExpr] Port NNC lowerings to the new registry mechanism. (#65551 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551 Previously we had a big switch on Op kind to decide how to lower a given JIT operator to NNC. This PR changes this switch to a hash table lookup. Why? This helps us with at least two things: 1) With this approach we can easily check if we know how to handle a given node in advance - i.e. we can inspect the entire graph and tell whether it's possible to compile it or not without actually trying to do that and dying in the middle. This would allow us to, say, provide user-friendly error messages in AOT workflow. 2) We can switch to use schema instead of op kind to determine correct lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963) and using it instead of schema can lead to bugs. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D31148926 Pulled By: ZolotukhinM fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704	2021-09-30 22:56:18 -07:00
Raghavan Raman	8f3983254b	[MicroBench] Added a micro benchmark for prefix sum (#65790 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65790 Here are the results of the benchmark: * ATen - version that calls `at::cumsum` * NNC - a simple prefix-sum loop implemented in NNC (not vectorized) * Local - a C++ implementation of the simple prefix-sum loop * LocalAVX2 - a vectorized C++ implementation of prefix-sum, only using AVX2 * LocalAVX512 - a vectorized C++ implementation of prefix-sum, using AVX512. The vectorized implementations are from the paper "Parallel Prefix Sum with SIMD" in ADMS' 20. ``` $ OMP_NUM_THREADS=1 ./buck-out/opt/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench --benchmark_filter=PrefixSumBench Run on (36 X 1601 MHz CPU s) 2021-09-28 23:13:12 ------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------ PrefixSumBench/ATen/64 1289 ns 1289 ns 543199 GB/s=397.069M/s PrefixSumBench/ATen/256 1867 ns 1867 ns 374232 GB/s=1096.8M/s PrefixSumBench/ATen/1024 4169 ns 4169 ns 167889 GB/s=1.9649G/s PrefixSumBench/ATen/4096 14137 ns 14136 ns 49266 GB/s=2.31806G/s PrefixSumBench/ATen/16384 49887 ns 49883 ns 13988 GB/s=2.6276G/s PrefixSumBench/ATen/65536 193742 ns 193686 ns 3628 GB/s=2.7069G/s PrefixSumBench/ATen/262144 764803 ns 764774 ns 917 GB/s=2.74219G/s PrefixSumBench/ATen/1048576 3040653 ns 3040277 ns 231 GB/s=2.75916G/s PrefixSumBench/Local/64 586 ns 586 ns 1197003 GB/s=873.244M/s PrefixSumBench/Local/256 1077 ns 1077 ns 646265 GB/s=1.90143G/s PrefixSumBench/Local/1024 3050 ns 3050 ns 229458 GB/s=2.68579G/s PrefixSumBench/Local/4096 11910 ns 11910 ns 58953 GB/s=2.75132G/s PrefixSumBench/Local/16384 43204 ns 43202 ns 16081 GB/s=3.03393G/s PrefixSumBench/Local/65536 167966 ns 167966 ns 4154 GB/s=3.12139G/s PrefixSumBench/Local/262144 667631 ns 667613 ns 1048 GB/s=3.14127G/s PrefixSumBench/Local/1048576 2654785 ns 2654631 ns 264 GB/s=3.15999G/s PrefixSumBench/NNC/64 642 ns 642 ns 1095277 GB/s=797.442M/s PrefixSumBench/NNC/256 1139 ns 1138 ns 617214 GB/s=1.799G/s PrefixSumBench/NNC/1024 3103 ns 3103 ns 225531 GB/s=2.63979G/s PrefixSumBench/NNC/4096 12053 ns 12052 ns 58084 GB/s=2.71883G/s PrefixSumBench/NNC/16384 43227 ns 43225 ns 16192 GB/s=3.03231G/s PrefixSumBench/NNC/65536 168065 ns 168056 ns 4153 GB/s=3.11972G/s PrefixSumBench/NNC/262144 668974 ns 668921 ns 1045 GB/s=3.13513G/s PrefixSumBench/NNC/1048576 2657464 ns 2657341 ns 263 GB/s=3.15677G/s PrefixSumBench/LocalAVX2/64 523 ns 523 ns 1351308 GB/s=979.537M/s PrefixSumBench/LocalAVX2/256 755 ns 755 ns 927762 GB/s=2.71159G/s PrefixSumBench/LocalAVX2/1024 1759 ns 1759 ns 400355 GB/s=4.65609G/s PrefixSumBench/LocalAVX2/4096 6708 ns 6706 ns 103959 GB/s=4.88649G/s PrefixSumBench/LocalAVX2/16384 22143 ns 22142 ns 31229 GB/s=5.91951G/s PrefixSumBench/LocalAVX2/65536 83649 ns 83642 ns 8350 GB/s=6.26828G/s PrefixSumBench/LocalAVX2/262144 330433 ns 330427 ns 2133 GB/s=6.34679G/s PrefixSumBench/LocalAVX2/1048576 1302301 ns 1302179 ns 537 GB/s=6.44198G/s PrefixSumBench/LocalAVX512/64 474 ns 474 ns 1459151 GB/s=1080.8M/s PrefixSumBench/LocalAVX512/256 576 ns 576 ns 1217442 GB/s=3.55524G/s PrefixSumBench/LocalAVX512/1024 994 ns 994 ns 703387 GB/s=8.24434G/s PrefixSumBench/LocalAVX512/4096 3642 ns 3641 ns 190646 GB/s=8.99857G/s PrefixSumBench/LocalAVX512/16384 10140 ns 10140 ns 68947 GB/s=12.9267G/s PrefixSumBench/LocalAVX512/65536 35739 ns 35736 ns 19567 GB/s=14.6711G/s PrefixSumBench/LocalAVX512/262144 156415 ns 156413 ns 4467 GB/s=13.4078G/s PrefixSumBench/LocalAVX512/1048576 613952 ns 613876 ns 1144 GB/s=13.665G/s ``` Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D31253849 Pulled By: navahgar fbshipit-source-id: f33e7be787c86a09e90babddd66b16e2e0777eb4	2021-09-30 14:44:52 -07:00
Mike Iovine	5f7ab7be6f	[Static Runtime] concat_add_mul_replacenan_clip retains axis arg (#65741 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741 This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing. Reviewed By: hlu1 Differential Revision: D31234944 fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4	2021-09-29 08:04:20 -07:00
Philip Meier	aebde1bc2b	deprecate device getter from `torch.testing` namespace (#63844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732	2021-09-29 02:40:52 -07:00
Kushashwa Ravi Shrimali	4752453d27	[Structured Kernels] Port for `baddbmm` and `bmm` (#64805 ) Summary: This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation. Issue tracker: https://github.com/pytorch/pytorch/issues/55070 cc: ysiraichi ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805 Reviewed By: gchanan Differential Revision: D31134454 Pulled By: ezyang fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261	2021-09-28 11:07:31 -07:00
Ben Koopman	6a6ee92e36	[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727	2021-09-27 16:01:49 -07:00
Mike Iovine	ef9e560796	[Static Runtime] Add aten::remainder out variant (#64967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64967 Out variant implementation for `aten::remainder`. Added both scalar and tensor overloads. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Remainder` Reviewed By: d1jang Differential Revision: D30915469 fbshipit-source-id: 9f27f18c86d66b11eac0aa4659c7062cb785b7e9	2021-09-24 07:51:39 -07:00
Raghavan Raman	31584d065e	[Static Runtime] Added NNC implementation for signed log1p kernel. (#65387 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387 Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op. Also, added a SR microbenchmark for this kernel which shows the performance improvement. Without fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 1953 ns 1953 ns 358746 BM_signed_log1p/64 2049 ns 2049 ns 342145 BM_signed_log1p/512 3291 ns 3291 ns 214342 BM_signed_log1p/4096 15559 ns 15559 ns 44420 BM_signed_log1p/32768 101936 ns 101935 ns 6843 BM_signed_log1p/65536 194792 ns 194789 ns 3615 ``` With NNC fusion: ``` -------------------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------------------- BM_signed_log1p/16 369 ns 369 ns 1896179 BM_signed_log1p/64 497 ns 497 ns 1406995 BM_signed_log1p/512 1618 ns 1618 ns 430209 BM_signed_log1p/4096 11327 ns 11326 ns 61463 BM_signed_log1p/32768 84099 ns 84086 ns 8325 BM_signed_log1p/65536 166531 ns 166510 ns 4186 ``` This clearly shows >15% improvement in performance of this kernel with NNC fusion. On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops: without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved) with NNC fusion: `0.55%` Test Plan: `buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1) ``` get 57220 prediction values get 57220 prediction values max_error: 0 total: 0 ``` Reviewed By: hlu1 Differential Revision: D30609492 fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd	2021-09-22 15:53:33 -07:00
Rodrigo Berriel	a0dea074b2	Remove `.data` from benchmarks and tensorboard (#65389 ) Summary: Related to https://github.com/pytorch/pytorch/issues/30987 and https://github.com/pytorch/pytorch/issues/33628. Fix the following tasks: - Remove the use of `.data` in all our internal code: - [x] `benchmarks/` - [x] `torch/utils/tensorboard/` cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 albanD gchanan Pull Request resolved: https://github.com/pytorch/pytorch/pull/65389 Reviewed By: soulitzer Differential Revision: D31093464 Pulled By: albanD fbshipit-source-id: 3a9c8834fd544a59a1cc2b930ae538fd1d46b232	2021-09-22 11:16:59 -07:00
jiej	127c9402d0	Revert "Revert D30752939: [pytorch][PR] nvfuser update" (#65137 ) Summary: This reverts commit `03389dc851`. Attempt again for PR: https://github.com/pytorch/pytorch/issues/63745 Fixes the windows build failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d	2021-09-22 04:54:51 -07:00
Hao Lu	ce101fed02	[PyPer] copy-free freeze_module (#65118 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118 Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase. Reviewed By: eellison, movefast1990 Differential Revision: D30955053 fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0	2021-09-20 17:25:10 -07:00
Mike Iovine	99e4ab5d44	[Static Runtime] Implement and enable variadic tuple unpack (#64934 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934 Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from: ``` %0, %1 = prim::TupleUnpack(%a) %2, %3 = prim::TupleUnpack(%b) ``` into: ``` %0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b) ``` The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about. Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack` Reviewed By: d1jang Differential Revision: D30872109 fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9	2021-09-20 10:36:11 -07:00
Don Jang	ae00075ac7	[Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65123 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123 This change re-reverts D30883290 (`0e11454d19`). D30883290 (`0e11454d19`) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`. ``` ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod); Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime' Sep 15 15:39:57 auto sr = getStaticRuntime(); Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~ Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_; Sep 15 15:39:57 ^ Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete; Sep 15 15:39:57 ^ Sep 15 15:39:57 2 errors generated. ``` This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine). Original Summary: This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp. `MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors. This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support. Test Plan: - Confirm that OSS build went well (See External Tests section). Reviewed By: mikeiovine Differential Revision: D30983292 fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116	2021-09-17 13:32:01 -07:00
albanD	473e55d5b2	Use classmethods for overrides (#64841 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64841 Test Plan: Imported from OSS Reviewed By: heitorschueroff Differential Revision: D30991424 Pulled By: albanD fbshipit-source-id: 551e2119768f3a4292713f3bfa83930f5506adbd	2021-09-17 08:32:49 -07:00
Don Jang	8241193d76	[Static Runtime] Introduce static_runtime::dict_unpack (#64771 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64771 Test Plan: - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithImmutableInputDict` - Added `StaticRuntime.RemoveImmutableInputDictLookupsWithMutableInputDict` - TBD: Perf impact measurement Reviewed By: mikeiovine Differential Revision: D30685083 fbshipit-source-id: 050a92ef3b3ed0fdc0ab7a13a4b5dbfede9342a9	2021-09-16 23:25:13 -07:00
Eli Uriegas	03389dc851	Revert D30752939: [pytorch][PR] nvfuser update Test Plan: revert-hammer Differential Revision: D30752939 (`cfaecaf40b`) Original commit changeset: ce122e80f01b fbshipit-source-id: 57685df8f9946032a06eff1de8a3d1498500d2d2	2021-09-15 17:38:47 -07:00
jiej	cfaecaf40b	nvfuser update (#63745 ) Summary: Syncing nvfuser code base from devel branch, Listing a few of our development since last sync: - Extends support to normalization and reduction kernels. - Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation. - profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes). To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle. internal updates are files located in: 1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda` 2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser` 3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h` updates affecting integration: 1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/`, 2. exposed a few more symbols `aten/src/ATen/core/` used by codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745 Reviewed By: saketh-are Differential Revision: D30752939 Pulled By: malfet fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c	2021-09-15 14:42:55 -07:00
Don Jang	3fb33b38b9	[Static Runtime] Check if outputs of a node do not overlap with each other (#63013 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013 This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs. This check will detect a problem like T97393697 immediately in debug mode. Test Plan: - Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs` - Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run. Reviewed By: hlu1 Differential Revision: D30211705 fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0	2021-09-15 08:38:05 -07:00
Mikhail Zolotukhin	f23f21dafe	[TensorExpr] Remove 'Placeholder' class. (#64887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887 BufHandle has exactly the same functionality and should be used instead. Differential Revision: D30889483 D30889483 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3	2021-09-14 00:22:44 -07:00
Eddie Ren	9c73a48ecf	ND Embeddings benchmark - Standardize randomized inputs (#64707 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672	2021-09-13 06:47:35 -07:00
Raghavan Raman	2cc9778495	[MicroBench] Added a log_vml version of the signed log1p kernel (#64205 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205 The log_vml version of the micro-bench is over 2x faster than the log1p version. Here are the perf numbers: ``` --------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... --------------------------------------------------------------------------------------------- SignedLog1pBench/ATen/10/1467 45915 ns 45908 ns 14506 GB/s=2.5564G/s SignedLog1pBench/NNC/10/1467 40469 ns 40466 ns 17367 GB/s=2.9002G/s SignedLog1pBench/NNCLogVml/10/1467 19560 ns 19559 ns 35902 GB/s=6.00016G/s ``` Thanks to bertmaher for pointing this out. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D30644716 Pulled By: navahgar fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19	2021-09-10 16:49:06 -07:00
Eddie Ren	3fbb49e75d	Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e	2021-09-10 16:49:02 -07:00
Mike Iovine	616fd9219d	[Static Runtime] Add sign/abs/lop1p/mul fusion pass (#64209 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209 Add a new fusion pass that turns transforms the following pattern: ``` graph(%input): %0 : Tensor = aten::sign(%input) %1 : Tensor = aten::abs(%input) %2 : Tensor = aten::log1p(%1) %res : Tensor = aten::mul(%0, %2) return (%res) ``` Into a single op: ``` graph(%input): %res : Tensor = static_runtim::signed_log1p(%input) return (%res) ``` The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff does not enable this pass. Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes. Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p` Test passed with new graph pass disabled and enabled. Reviewed By: hlu1 Differential Revision: D30559929 fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a	2021-09-02 08:31:40 -07:00
Ray Peng	09e610e36d	[Static Runtime] Out version for softmax (#64243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64243 Test Plan: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ... V0830 16:35:22.524479 613839 impl.cpp:1410] Switch to out variant for node: %5 : Tensor = aten::softmax(%a.1, %dim.1, %dtype.1) ... [ OK ] StaticRuntime.IndividualOps_Softmax (803 ms) ``` Reviewed By: hlu1 Differential Revision: D30656149 fbshipit-source-id: 115b7b4a75448fd6a5c526808080ca9a4251302c	2021-08-31 18:33:26 -07:00
Harut Movsisyan	3c15822f5f	[Static Runtime] Implement aten::nonzero out variant (#64126 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64126 Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30617729 fbshipit-source-id: 752749638c8f467815efa57021cb3de5c728ab1b	2021-08-31 00:51:15 -07:00
Harut Movsisyan	1f16c22dc8	[Static Runtime] Implement aten::cumsum out variant (#64159 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: mikeiovine Differential Revision: D30622819 fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736	2021-08-30 16:18:22 -07:00
Harut Movsisyan	e24c3644d8	[Static Runtime] aten::cat out version when it is not being replaced by prim::VarConcat (#64157 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157 UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR. Test Plan: Confirm out variant is called: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30598574 fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a	2021-08-30 09:42:38 -07:00
Raghavan Raman	dc4fd3bdda	[MicroBench] Added a micro benchmark for a signed log1p kernel. (#64032 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64032 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D30579198 Pulled By: navahgar fbshipit-source-id: a53d68225fba768b26491d14b535f8f2dcf50c0e	2021-08-30 09:27:51 -07:00
Harut Movsisyan	8af1407eab	[Static Runtime] Out version for torch.linalg.norm (#64070 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070 Test Plan: Confirm out variant is called for both versions: ``` > buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 ``` Reviewed By: d1jang Differential Revision: D30595816 fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563	2021-08-29 21:00:11 -07:00
Don Jang	9f1f22b9bc	[Static Runtime] Add out variant of quantized::embedding_bag_byte_prepack (#64081 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64081 This change add an out variant of `quantized::embedding_bag_byte_prepack`. Test Plan: - Added `ShapeInferenceTest.QEmbeddingBagByteUnpack`. - Observed ``` V0824 13:38:49.723708 1322143 impl.cpp:1394] Switch to out variant for node: %2 : Tensor = quantized::embedding_bag_byte_prepack(%input) ``` Reviewed By: hlu1 Differential Revision: D30504216 fbshipit-source-id: 1d9d428e77a15bcc7da373d65e7ffabaf9c6caf2	2021-08-27 10:53:23 -07:00
Harut Movsisyan	f2c47cf4db	[Static Runtime] Out version for fmod (#64046 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64046 Test Plan: Confirm out variant is used: ``` > //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1 V0826 23:31:30.321382 193428 impl.cpp:1395] Switch to out variant for node: %4 : Tensor = aten::fmod(%a.1, %b.1) ``` Reviewed By: mikeiovine Differential Revision: D30581228 fbshipit-source-id: dfab9a16ff8afd40b29338037769f938f154bf74	2021-08-27 03:05:06 -07:00
Don Jang	c90b3cb1da	[Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078 This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime. Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it: ``` at::Tensor mean = create_empty_from({M}, X); at::Tensor rstd = create_empty_from({M}, X); ``` that the static runtime misses an opportunity to manage. This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors. Test Plan: - Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated. - Confirmed that the new op gets activated during testing: ``` V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3) ``` Reviewed By: hlu1 Differential Revision: D30486475 fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d	2021-08-27 02:44:43 -07:00
Don Jang	cbfec02007	[Static Runtime] Add native op for aten::expand_as (#64024 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024 `aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime. Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs` Reviewed By: hlu1 Differential Revision: D30546851 fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0	2021-08-26 13:05:53 -07:00
Hao Lu	6fa646ad54	[StaticRuntime] Fix bug in HasInplaceOp (#63842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63842 Reviewed By: mikeiovine Differential Revision: D30506914 fbshipit-source-id: b2e358cfb991dacdb295b61bbc37beb36b73b852	2021-08-24 17:07:45 -07:00
Harut Movsisyan	956c8fa01e	Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b	2021-08-24 16:26:26 -07:00
Mike Iovine	7774a4e95b	[Static Runtime] Implement prim::VarStack out variant (#63579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579 Provide a static runtime out variant implementation for the new op introduced in D30426232 (`1385f9fb12`). Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack` Reviewed By: navahgar Differential Revision: D30410525 fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8	2021-08-24 09:44:29 -07:00
Mikhail Zolotukhin	f0d274294d	[TensorExpr] Nuke KernelArena and KernelScope. (#63587 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587 Now that there is no classes using KernelArena for memory management we can remove it. Differential Revision: D30429115 D30429115 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544	2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin	62d02f2b57	[TensorExpr] Make 'Tensor' a value type. (#63586 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586 This is another commit in transition from KernelArena memory management. Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need to dynamically allocate it at all - it's cheap to pass it by value, and that's what we're switching to in this commit. After this change nothing uses KernelScope/KernelArena and they can be safely removed. Differential Revision: D30429114 D30429114 Test Plan: Imported from OSS Reviewed By: navahgar Pulled By: ZolotukhinM fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819	2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin	dd96c26066	[TensorExpr] More NFC changes like Expr* -> ExprPtr. (#63778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778 This is a preparation for a switch from raw pointers to shared pointers as a memory model for TE expressions and statements. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D30487425 Pulled By: ZolotukhinM fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c	2021-08-24 00:30:49 -07:00
Don Jang	84890aae35	[Static Runtime] Add an out variant op for aten::abs (#63675 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63675 This change adds an out variant implementation for `aten::abs`. Test Plan: - Observed `V0820 14:14:08.880342 101788 impl.cpp:1394] Switch to out variant for node: %3 : Tensor = aten::abs(%a.1)` - Perf impact: TBD Reviewed By: hlu1 Differential Revision: D30461317 fbshipit-source-id: 0c0230bd40afe463ae1ccb222c2a1207ebcf4191	2021-08-23 16:25:10 -07:00
Hao Lu	b2a601ffe5	[Static Runtime] Implement out variant for fb::quantized_linear (#63635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63635 Reviewed By: ajyu Differential Revision: D30446234 fbshipit-source-id: 1ef014186ff725930a97d0159626f9233ee74030	2021-08-20 21:42:22 -07:00
Don Jang	913c1f83f4	[Static Runtime] Add native op for aten::detach (#63625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63625 This change adds a static runtime's native op implementation for `aten::detach` op. See the standard `aten::detach`'s implementation (https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp.html#_ZN2at6native6detachERKNS_6TensorE ) for comparison. Test Plan: - Added `StaticRuntime.IndividualOps_Detach`. - Observed ``` V0819 18:55:33.181188 3092034 impl.cpp:1398] Switch to native impl for node: %a.1 : Tensor = aten::detach(%input.1) ``` Reviewed By: hlu1 Differential Revision: D30443187 fbshipit-source-id: d6e0eadb1b817e0a126c4fc97526abc276ee8a17	2021-08-20 00:46:27 -07:00
Philip Meier	99203580a9	Updates internal `assert_allclose` callsites in favor of `assert_close` (#61841 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61841 Redo of #60863. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D30408145 Pulled By: mruberry fbshipit-source-id: 0b34ebc7f23ba38ecd89640b61d8aca59b7eab58	2021-08-19 12:50:41 -07:00
Mike Iovine	47a9e8ff32	[Static Runtime] Support __getitem__ for lists (#63398 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398 This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter. Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D30368464 fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764	2021-08-19 06:38:51 -07:00
Mike Iovine	9d9e7a8d72	[Static Runtime] Implement aten::append (#63350 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350 Add a native implementation for `aten::append`, the list append op. Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append` Reviewed By: hlu1 Differential Revision: D30326461 fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3	2021-08-17 13:40:18 -07:00
Mike Iovine	078b8004a6	[Static Runtime] Implement prim::TupleUnpack (#63243 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63243 Add `prim::TupleUnpack` native op to static runtime. Test Plan: Unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D30306955 fbshipit-source-id: 21923d6cbd5545c144ac051b3d48b37ec6e610cf	2021-08-16 14:56:30 -07:00
Mike Iovine	3dcd785cac	[Static Runtime] Add tests for all aten ops (#62347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347 This diff includes tests for all `aten` ops that did not already have test coverage. Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D29968280 fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2	2021-08-09 12:09:59 -07:00
Rong Rong (AI Infra)	7f1b672b7a	Revert D29952381: [Static Runtime] Ensure that unittests only use out variants or native ops Test Plan: revert-hammer Differential Revision: D29952381 (`8737e17af2`) Original commit changeset: e60e70b80ccf fbshipit-source-id: 59dc2f920b7ceaf94ba8f5f36024e7cc710f6645	2021-08-04 14:25:11 -07:00
Don Jang	8737e17af2	[Static Runtime] Ensure that unittests only use out variants or native ops (#62335 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335 This change ensures that unittests only use out variants or native ops. - Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that. - We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that. - `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op = true` since they are written to use interpreter ops by design. Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully. Reviewed By: mikeiovine, hlu1 Differential Revision: D29952381 fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702	2021-08-04 11:37:15 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Mike Iovine	34f50c6e35	[Static Runtime] testStaticRuntime verifies that # of nodes is at least 2 (#62622 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622 This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used. Reviewed By: hlu1 Differential Revision: D30058050 fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b	2021-08-03 15:55:57 -07:00
Raghavan Raman	b91a917616	[Static Runtime] Fixed another build failure in OSS due to test_utils.h (#62338 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62338 Test Plan: Imported from OSS Reviewed By: d1jang Differential Revision: D29965744 Pulled By: navahgar fbshipit-source-id: cf3e54ac13432ea8afc4b718fac6c9768743d01b	2021-07-28 11:41:33 -07:00
Don Jang	68efa186cc	[static runtime] Implement aten::full (#62227 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62227 Test Plan: Added `StaticRuntime.IndividualOps_Full` to cover the newly added code path. Reviewed By: hlu1 Differential Revision: D29923649 fbshipit-source-id: 722950137c35ae325590a670b97f03b395e8eac3	2021-07-28 09:50:27 -07:00
Mike Iovine	e1bee3eb30	[Static Runtime] Add missing unit tests for static runtime ops (#62238 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62238 Added tests for the following ops: * `aten::mul` * `aten::nan_to_num` * `aten::stack` * `aten::relu` * `aten::tanh` Reviewed By: hlu1 Differential Revision: D29914217 fbshipit-source-id: 6a6c39629310e7131127e24fdce7253ccdf80340	2021-07-27 14:12:21 -07:00
Raghavan Raman	60070982d2	[Static Runtime] Fixed build failure in OSS due to test_utils (#62216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62216 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D29917514 Pulled By: navahgar fbshipit-source-id: 379863e6cd0b157de3bfa1482f5519b26654b3d2	2021-07-26 16:10:10 -07:00
Mike Iovine	6007ad3529	[Static Runtime] Refactor fb op tests to use testStaticRuntime (#62064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064 `testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators` Reviewed By: hlu1 Differential Revision: D29858928 fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c	2021-07-26 08:25:10 -07:00
Hao Lu	78f7d8ccfa	[Static Runtime] Remove wrappers for aten::cat (#62067 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62067 The wrapper for aten::cat is no longer needed after the variadic cat change in D29565344 (`ae58a4c45d`) . Also added a simple test to test dynamic shapes, i.e., input tensors in args2 are larger than in args1. Reviewed By: navahgar, mikeiovine Differential Revision: D29864600 fbshipit-source-id: 44a712c2e776815c09e0bf5631412149b81274b2	2021-07-23 20:33:41 -07:00
Hao Lu	cf3cc01f1d	[Static Runtime] Add is_frozen to StaticModule ctor (#62020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62020 Add is_frozen to StaticModule ctor so we can skip freezing in StaticModule. Reviewed By: ajyu, mikeiovine Differential Revision: D29807431 fbshipit-source-id: 7742e9f5c5ae9f442a9e4007c870a14fd8b4af20	2021-07-23 15:12:35 -07:00
Mike Iovine	ec4e6181e6	[Static Runtime] Fix broken test_static_runtime build (#62098 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62098 The build was broken by D29821533 (`1d2ea76afb`). The `clamp` overloads used in `deep_wide.h` are no longer available in the `at::native` namespace. Use `at::cpu::clamp` and `at:🗜️:clip_out` (which should be an alias for clamp) instead. Reviewed By: hlu1 Differential Revision: D29880187 fbshipit-source-id: 210b6d2be8a8142e7af1a0ba07e55a95b1a77d25	2021-07-23 12:35:09 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Mike Iovine	2b0eddb0aa	[Static Runtime] Implement prim::isinstance and prim::TypeCheck (#61783 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61783 Implement two new prim operators for static runtime: `isinstance` and `TypeCheck`. `isinstance` is very straightforward, but there were a few wrinkles with implementing `TypeCheck`: 1. There is no way to directly generate `TypeCheck` nodes from TorchScript, they are generated by the JIT at runtime. This makes testing a little difficult. I had to make some modifications to `testStaticRuntime` to allow for the use of IR and TorchScript tests. 2. The behavior of `prim::TypeCheck` as implemented here does not match up 1:1 with the version implemented in the interpreter! This is because grad mode is disabled in static runtime. Here's an example. IR is the same as the one included in this test, but with `requires_grad == 1` ``` graph(%a.1 : Tensor, %b.1 : Tensor): %t0 : Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), %t1 : Float(3, 3, strides=[3, 1]), %type_matched : bool = prim::TypeCheck[types=[Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), Float(3, 3, strides=[3, 1])]](%a.1, %b.1) return (%t0, %t1, %type_matched) ``` And in the test setup: ``` auto a = at::zeros({2, 2}, at::kFloat); a.to(at::kCPU); a.set_requires_grad(true); auto b = at::ones({3, 3}, at::kFloat); std::vector<IValue> args_correct = {a, b}; // prim::TypeCheck should be true with args_correct, // but we get false when using static runtime! ``` Reviewed By: hlu1 Differential Revision: D29743862 fbshipit-source-id: db1788f0f5de42bab42602e8cc24eee04cbcc280	2021-07-22 10:23:35 -07:00
Raghavan Raman	ae58a4c45d	[Static Runtime] Added a variadic cat operator (#61302 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61302 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D29565344 Pulled By: navahgar fbshipit-source-id: 96f5f4546ec0e61eb7f87e016e026e7b62576248	2021-07-21 15:58:20 -07:00
zhouzhuojie	ac5a40e068	Fix benchmark's import module and remove its usage of tools.stats.scribe (#61808 ) Summary: There're a few convoluted logic here to fix the `benchmarks`'s import module for pytest. - On one hand, if we want to use `tools.stats.scribe` from `benchmarks`, we will need to add `benchmarks/__init__.py` - On the other hand, if we add `benchmarks/__init__.py`, it breaks how `pytest` is working on searching what is the system built `torch` instead of the local source module `../torch` - That's why we are seeing errors like ``` ImportError while loading conftest '/var/lib/jenkins/workspace/benchmarks/fastrnns/conftest.py'. benchmarks/fastrnns/__init__.py:1: in <module> from .cells import * # noqa: F403 benchmarks/fastrnns/cells.py:1: in <module> import torch torch/__init__.py:29: in <module> from .torch_version import __version__ as __version__ torch/torch_version.py:9: in <module> from .version import __version__ as internal_version E ModuleNotFoundError: No module named 'torch.version' ``` Instead, this PR changed the usage of `upload_scribe.py` back to its original form using HTTP request, and only circleci for now will continue the this path using the `python benchmarks/upload_scribe.py`, which is gated by `if [[ -z "${GITHUB_ACTIONS}" ]];` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61808 Reviewed By: seemethere Differential Revision: D29750188 Pulled By: zhouzhuojie fbshipit-source-id: 3b842b21978f2159001e9c6c1cdc96c5a0515f2e	2021-07-19 09:45:05 -07:00
zhouzhuojie	59ca89dca8	Fix scribe logs again (#61768 ) Summary: revert the revert of `3624d75` with additional fix in https://github.com/pytorch/pytorch/pull/61764 Got the corrent logs sent to lambda ``` ... ,"21721":"OK","21722":"OK","21723":"OK","21724":"OK","21725":"OK","21726":"OK","21727":"OK","21728":"OK","21729":"OK","21730":"OK","21731":"OK","21732":"OK","21733":"OK","21734":"OK","21735":"OK","21736":"OK","21737":"OK","21738":"OK","21739":"OK","21740":"OK","21741":"OK","21742":"OK","21743":"OK","21744":"OK","21745":"OK","21746":"OK","21747":"OK","21748":"OK","21749":"OK","21750":"OK","21751":"OK","21752":"OK","21753":"OK","21754":"OK","21755":"OK","21756":"OK","21757":"OK","21758":"OK","21759":"OK","21760":"OK","21761":"OK","21762":"OK","21763":"OK","21764":"OK","21765":"OK","21766":"OK","21767":"OK","21768":"OK","21769":"OK","21770":"OK","21771":"OK","21772":"OK","21773":"OK","21774":"OK","21775":"OK","21776":"OK","21777":"OK","21778":"OK","21779":"OK","21780":"OK","21781":"OK","21782":"OK","21783":"OK","21784":"OK","21785":"OK","21786":"OK","21787":"OK","21788":"OK","21789":"OK","21790":"OK","21791":"OK","21792":"OK","21793":"OK","21794":"OK","21795":"OK","21796":"OK","21797":"OK","21798":"OK","21799":"OK","21800":"OK","21801":"OK","21802":"OK","21803":"OK","21804":"OK","21805":"OK","21806":"OK","21807":"OK","21808":"OK","21809":"OK","21810":"OK","21811":"OK","21812":"OK","21813":"OK","21814":"OK","21815":"OK","21816":"OK","21817":"OK","21818":"OK","21819":"OK","21820":"OK","21821":"OK","21822":"OK","21823":"OK","21824":"OK","21825":"OK","21826":"OK"}} class StartProcessesTest: tests: 14 failed: 0 skipped: 0 errored: 0 run_time: 4.86 seconds avg_time: 0.35 seconds median_time: 0.01 seconds 3 longest tests: test_function_large_ret_val time: 1.55 seconds test_pcontext_wait time: 1.11 seconds test_void_function time: 1.03 seconds ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/61768 Reviewed By: janeyx99 Differential Revision: D29735781 Pulled By: zhouzhuojie fbshipit-source-id: 6882e334f5108d20773ad66d5300cd37eb509ded	2021-07-16 17:56:16 -07:00
Mike Iovine	28150fd0c8	[static_runtime] Implement aten::linear (#61595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61595 Add out variant wrapper for `aten::linear` in the static runtime Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest` Reviewed By: hlu1 Differential Revision: D29684236 fbshipit-source-id: 94df6d7267b3f269b2cadf065f207648777147df	2021-07-16 08:55:43 -07:00
Jeffrey Wan	3624d75864	Revert D29703523: [pytorch][PR] Fix scribe logs Test Plan: revert-hammer Differential Revision: D29703523 (`eb5a56fb74`) Original commit changeset: 829ad3630d35 fbshipit-source-id: 2b2196d58791b995a008b6d810b3248ed27e7d94	2021-07-16 08:50:13 -07:00
zhouzhuojie	eb5a56fb74	Fix scribe logs (#61675 ) Summary: Related to https://github.com/pytorch/pytorch/issues/61632 This PR adds - refactoring of scribe related code to scribe.py - changed the `render_test_results` job to always use the `linux.2xlarge` runner - if SCRIBE_GRAPHQL_ACCESS_TOKEN is empty, try boto3 instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/61675 Reviewed By: seemethere Differential Revision: D29703523 Pulled By: zhouzhuojie fbshipit-source-id: 829ad3630d3500a498b41aa458ce6539aaeae938	2021-07-15 19:27:58 -07:00
Bo Wang	e098e9000b	Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61507 Benchmark Python-only DDP vs production C++ based DistributedDataParallel. - Implemented a pure python DDP: PythonDDP with support of SYNC and ASYNC reduction - Added compare_ddp to measure the difference in forward and backward step Kudos on Shen and Yi for the great idea. Test Plan: Test on DevGPUS with 2 CUDA devices. $python compare_ddp.py Python only DDP has slightly better (-1%) forward performance and slightly slower (2%-20%) backward performance. This suggested that we need to keep C++ Core since the maximum latency increase can be 20%. See README.md for details. Imported from OSS Differential Revision: D29685364 D29685364 Reviewed By: mrshenli Pulled By: bowangbj fbshipit-source-id: 429e4473fac0ec4c70d6db12d946d2636dd6477a	2021-07-15 12:52:22 -07:00
Don Jang	94965212e5	[static runtime] Use at::allclose to test NNC sigmoid (#61566 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61566 This change uses `at::allclose` to compare results from sigmoid functions (CPU/NNC) instead of `Tensor::equals` due to numerical errors occurring between them. Test Plan: I confirmed that the flakiness of `StaticRuntime.Sigmoid` is gone with this change: ``` [djang@devvm1999.ftw0 ~/fbsource/fbcode] buck-out/gen/caffe2/benchmarks/static_runtime/static_runtime_cpptest -v 3 --gtest_filter=StaticRuntime.Sigmoid --gtest_repeat=100 &> output.txt [djang@devvm1999.ftw0 ~/fbsource/fbcode] grep PASSED output.txt \| wc 100 500 2100 ``` Reviewed By: bertmaher Differential Revision: D29671203 fbshipit-source-id: 99a7b16d18ea047c9aad444f36d8368f9d0b088d	2021-07-14 19:48:00 -07:00
Garrett Cramer	5a5c7f563d	add trainer hook functions (#60785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60785 This pr adds hook functions for the trainers. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29697299 Pulled By: gcramer23 fbshipit-source-id: cc3b991aad0d32503fbfc5acd4fca8b404e74c0f	2021-07-14 13:19:17 -07:00
Garrett Cramer	304c02ee44	refactor ps benchmark (#60784 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60784 This pr refactors the ps benchmark for modular trainers. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29697291 Pulled By: gcramer23 fbshipit-source-id: 64579a1f5326d3cd9f32936dcf53bc243d54b71d	2021-07-14 13:19:13 -07:00
Hao Lu	a07b08136f	[Static Runtime] Check unsupported up when enabling static runtime (#61613 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61613 Reviewed By: ajyu, movefast1990 Differential Revision: D29663466 fbshipit-source-id: d819903b7227f534c0a4fffa5eeea2b5c0c04750	2021-07-14 02:13:51 -07:00
Don Jang	8a2c7d902f	[static runtime] Add DCHECK to ensure that outputs do not overlap with immutable inputs (#61301 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301 This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs. Test Plan: Added unittests as follows: - `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments` - `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments` Reviewed By: hlu1 Differential Revision: D29564158 fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc	2021-07-12 18:04:05 -07:00
Supriya Rao	7a15576a65	[quant] update FakeQuant modules to use tensor qparams (#61318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318 Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator. Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU. Local benchmark P427668213 Before this change ``` Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_aminmax 2.57% 1.507ms 3.10% 1.819ms 36.371us 2.872ms 4.81% 2.872ms 57.446us 50 aten::fake_quantize_per_tensor_affine 1.04% 610.915us 3.60% 2.114ms 42.276us 472.896us 0.79% 2.698ms 53.962us 50 aten::fake_quantize_per_tensor_affine_cachemask 1.69% 993.626us 2.56% 1.503ms 30.058us 2.225ms 3.73% 2.225ms 44.504us 50 aten::is_nonzero 3.85% 2.258ms 19.68% 11.540ms 46.161us 2.168ms 3.63% 11.084ms 44.336us 250 aten::zeros_like 1.82% 1.064ms 6.65% 3.901ms 39.007us 1.531ms 2.57% 3.905ms 39.045us 100 aten::eq 13.80% 8.093ms 25.90% 15.189ms 37.972us 9.580ms 16.05% 15.566ms 38.914us 400 aten::item 5.67% 3.323ms 21.50% 12.607ms 36.019us 3.233ms 5.42% 12.167ms 34.762us 350 aten::zeros 0.94% 549.208us 2.93% 1.717ms 34.343us 688.928us 1.15% 1.695ms 33.894us 50 aten::le 2.52% 1.478ms 4.50% 2.641ms 26.411us 1.753ms 2.94% 2.845ms 28.448us 100 aten::rsub 1.04% 608.715us 2.44% 1.433ms 28.667us 532.000us 0.89% 1.418ms 28.353us 50 aten::max 1.54% 905.401us 4.62% 2.711ms 27.106us 847.488us 1.42% 2.697ms 26.969us 100 aten::ones 0.92% 542.159us 2.16% 1.266ms 25.324us 661.856us 1.11% 1.301ms 26.017us 50 aten::min 0.82% 479.167us 2.15% 1.258ms 25.160us 407.808us 0.68% 1.276ms 25.530us 50 aten::_local_scalar_dense 15.83% 9.284ms 15.83% 9.284ms 26.526us 8.934ms 14.97% 8.934ms 25.524us 350 aten::clamp 2.35% 1.378ms 4.21% 2.467ms 24.669us 1.546ms 2.59% 2.461ms 24.612us 100 aten::zero_ 2.53% 1.482ms 5.65% 3.316ms 22.108us 1.326ms 2.22% 3.380ms 22.531us 150 aten::maximum 3.08% 1.805ms 3.08% 1.805ms 18.052us 1.849ms 3.10% 1.849ms 18.494us 100 aten::minimum 1.33% 778.854us 1.33% 778.854us 15.577us 868.672us 1.46% 868.672us 17.373us 50 aten::round 1.36% 799.910us 1.36% 799.910us 15.998us 809.568us 1.36% 809.568us 16.191us 50 aten::copy_ 6.61% 3.878ms 6.61% 3.878ms 15.513us 4.036ms 6.76% 4.036ms 16.143us 250 aten::div 2.53% 1.483ms 2.53% 1.483ms 14.833us 1.535ms 2.57% 1.535ms 15.353us 100 aten::mul 2.44% 1.431ms 2.44% 1.431ms 14.314us 1.478ms 2.48% 1.478ms 14.782us 100 aten::detach 1.46% 855.670us 2.41% 1.411ms 14.110us 832.448us 1.39% 1.395ms 13.949us 100 aten::add 2.22% 1.301ms 2.22% 1.301ms 13.008us 1.383ms 2.32% 1.383ms 13.828us 100 aten::fill_ 4.18% 2.452ms 4.18% 2.452ms 12.262us 2.693ms 4.51% 2.693ms 13.463us 200 aten::sub 5.06% 2.967ms 5.06% 2.967ms 14.837us 2.675ms 4.48% 2.675ms 13.374us 200 aten::to 2.10% 1.230ms 3.65% 2.140ms 10.701us 1.310ms 2.20% 2.062ms 10.310us 200 aten::select 1.28% 749.144us 1.49% 874.227us 8.742us 863.232us 1.45% 863.232us 8.632us 100 detach 0.95% 555.326us 0.95% 555.326us 5.553us 562.496us 0.94% 562.496us 5.625us 100 aten::as_strided 0.40% 232.289us 0.40% 232.289us 1.161us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 2.93% 1.720ms 2.93% 1.720ms 3.439us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.04% 611.313us 1.04% 611.313us 2.038us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.75% 438.585us 1.77% 1.036ms 5.180us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.36% 799.442us 1.36% 799.442us 3.198us 0.000us 0.00% 0.000us 0.000us 250 --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 58.645ms Self CUDA time total: 59.674ms ``` After this change ``` test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::fake_quantize_per_tensor_affine 0.98% 505.210us 4.38% 2.259ms 45.187us 419.424us 0.78% 3.218ms 64.367us 50 aten::_aminmax 2.78% 1.434ms 3.42% 1.766ms 35.321us 2.825ms 5.27% 2.825ms 56.505us 50 aten::fake_quantize_per_tensor_affine_cachemask_tens... 2.38% 1.229ms 3.40% 1.754ms 35.083us 2.799ms 5.22% 2.799ms 55.979us 50 aten::rsub 0.94% 485.040us 5.02% 2.590ms 51.793us 458.976us 0.86% 2.587ms 51.747us 50 aten::is_nonzero 3.78% 1.952ms 23.64% 12.196ms 48.786us 2.055ms 3.83% 11.986ms 47.944us 250 aten::item 6.92% 3.572ms 19.86% 10.244ms 40.977us 3.670ms 6.85% 9.931ms 39.724us 250 aten::zeros_like 1.65% 848.874us 6.64% 3.426ms 34.260us 1.397ms 2.61% 3.572ms 35.717us 100 aten::zeros 0.85% 436.691us 3.00% 1.549ms 30.984us 551.936us 1.03% 1.576ms 31.516us 50 aten::eq 10.60% 5.467ms 20.26% 10.452ms 26.130us 7.018ms 13.09% 10.832ms 27.079us 400 aten::le 2.58% 1.332ms 4.67% 2.407ms 24.074us 1.580ms 2.95% 2.614ms 26.144us 100 aten::_local_scalar_dense 12.93% 6.673ms 12.93% 6.673ms 26.691us 6.261ms 11.68% 6.261ms 25.046us 250 aten::clamp 2.43% 1.253ms 4.37% 2.256ms 22.560us 1.431ms 2.67% 2.273ms 22.725us 100 aten::ones 0.89% 460.133us 2.18% 1.123ms 22.467us 570.496us 1.06% 1.128ms 22.551us 50 aten::min 0.74% 383.132us 2.06% 1.065ms 21.296us 377.536us 0.70% 1.091ms 21.824us 50 aten::zero_ 2.36% 1.219ms 5.87% 3.029ms 20.194us 1.261ms 2.35% 3.199ms 21.327us 150 aten::max 1.51% 779.081us 4.06% 2.096ms 20.960us 791.680us 1.48% 2.130ms 21.295us 100 aten::sub 7.97% 4.111ms 7.97% 4.111ms 20.556us 3.847ms 7.18% 3.847ms 19.234us 200 aten::div 2.94% 1.516ms 2.94% 1.516ms 15.158us 1.580ms 2.95% 1.580ms 15.798us 100 aten::round 1.45% 750.445us 1.45% 750.445us 15.009us 756.064us 1.41% 756.064us 15.121us 50 aten::copy_ 6.88% 3.548ms 6.88% 3.548ms 14.190us 3.701ms 6.90% 3.701ms 14.803us 250 aten::minimum 1.32% 681.654us 1.32% 681.654us 13.633us 713.664us 1.33% 713.664us 14.273us 50 aten::maximum 2.55% 1.317ms 2.55% 1.317ms 13.169us 1.338ms 2.50% 1.338ms 13.378us 100 aten::mul 2.63% 1.358ms 2.63% 1.358ms 13.581us 1.328ms 2.48% 1.328ms 13.283us 100 aten::detach 1.34% 688.820us 2.35% 1.211ms 12.110us 772.800us 1.44% 1.278ms 12.779us 100 aten::fill_ 4.53% 2.338ms 4.53% 2.338ms 11.692us 2.495ms 4.65% 2.495ms 12.473us 200 aten::add 2.32% 1.197ms 2.32% 1.197ms 11.968us 1.240ms 2.31% 1.240ms 12.405us 100 aten::to 2.07% 1.069ms 3.66% 1.889ms 9.443us 1.224ms 2.28% 1.975ms 9.874us 200 aten::select 1.44% 743.042us 1.64% 848.207us 8.482us 641.600us 1.20% 641.600us 6.416us 100 detach 1.01% 522.155us 1.01% 522.155us 5.222us 505.088us 0.94% 505.088us 5.051us 100 aten::as_strided 0.44% 227.884us 0.44% 227.884us 1.139us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 3.20% 1.652ms 3.20% 1.652ms 3.304us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.25% 646.711us 1.25% 646.711us 2.156us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.79% 407.768us 2.07% 1.067ms 5.334us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.52% 785.788us 1.52% 785.788us 3.143us 0.000us 0.00% 0.000us 0.000us 250 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 51.590ms Self CUDA time total: 53.609ms ghstack-source-id: 133370215 Test Plan: buck test mode/dev-nosan caffe2/test/:quantization Reviewed By: raghuramank100 Differential Revision: D29566512 fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c	2021-07-10 19:43:02 -07:00
Don Jang	a74516d699	[static runtime] implement aten::log (#61393 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61393 Test Plan: Added `StaticRuntime.IndividualOps_Log` ``` ... [ RUN ] StaticRuntime.IndividualOps_Log V0701 12:10:50.829100 3708165 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0 V0701 12:10:50.888468 3708165 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::log(%inp.1) V0701 12:10:50.889098 3708165 impl.cpp:1279] Switch to out variant for node: %a.1 : Tensor = aten::clone(%3, %2) ``` Reviewed By: hlu1 Differential Revision: D29511622 fbshipit-source-id: 819fd7d90c084609a060efeadb3015e35acac517	2021-07-08 18:25:35 -07:00
Don Jang	c2b0af2560	[static runtime] Implement aten::sign (#61154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61154 Test Plan: Added `StaticRuntime.IndividualOps_Sign` ``` [djang@devvm861.prn0 ~/local/fbsource/fbcode/caffe2] buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- -v 1 ... [ RUN ] StaticRuntime.IndividualOps_Sign V0701 12:05:31.836099 3679080 impl.cpp:455] StaticModuleOptions: cleanup_activations 1, enable_out_variant 1, optimize_memory1, optimize_graph_output_memory0 V0701 12:05:31.898192 3679080 impl.cpp:1279] Switch to out variant for node: %3 : Tensor = aten::sign(%input.1) V0701 12:05:31.898849 3679080 impl.cpp:1279] Switch to out variant for node: %4 : Tensor = aten::clone(%3, %2) ``` Reviewed By: hlu1 Differential Revision: D29518603 fbshipit-source-id: e47b96d037fea639c41052f3849c82bbfa5f482a	2021-07-07 12:29:25 -07:00
Hao Lu	46595a9623	[Static Runtime] Add gflag to disable nnc and caffe2 math library (#61090 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61090 Reviewed By: ajyu Differential Revision: D29479860 fbshipit-source-id: 2b53405f41d319f074c75d8923d97fd6a45fee4b	2021-07-01 00:01:37 -07:00
Han-Hsien Huang	812ed47caa	[Static runtime] Add unit tests to ops bmm and addmm (#61000 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61000 Add unit tests to bmm and addmm operators in static runtime. Test Plan: buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest {F628935117} Reviewed By: hlu1 Differential Revision: D29459679 fbshipit-source-id: 5c7fa5c9b0675c1c84f3ae3110204d663255009c	2021-06-30 15:55:58 -07:00
Bert Maher	93772792e3	[nnc] Get rid of fuser trigger counters (#57334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334 Here's a possibly controversial PR. These counters got in the way of generalizing the fuser tests to handle arbitrary devices, and I guess I'm just generally skeptical that they provide much value. While true that they let us observe whether fusion groups were created, we already have assertions based on the shape of the graph, and I'm not sure that I trust those any less than these counters. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D29471484 Pulled By: bertmaher fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57	2021-06-29 22:22:15 -07:00
Hao Lu	e3abccec8a	[Static Runtime] Remove output type constraints (#60669 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669 Test Plan: Added unit test to check for nested outputs. Reviewed By: ajyu Differential Revision: D29322025 fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724	2021-06-26 02:36:27 -07:00
Basil Hosmer	cab926b2c0	faster generate_square_subsequent_mask in nn.Transformer (#60631 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631 Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small. PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around. Test Plan: Imported from OSS Reviewed By: jbschlosser, albanD Differential Revision: D29356673 Pulled By: bhosmer fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e	2021-06-25 16:07:01 -07:00
Garrett Cramer	4ed2d5d9bb	ps sparse rpc (#58003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58003 adds trainer class DdpTrainer adds trainer class DdpSparseRpcTrainer adds server class ParameterServerBase adds server class AverageParameterServer adds experiment ddp_cpu_sparse_rpc_nccl_allreduce adds experiment ddp_cuda_sparse_rpc_nccl_allreduce quip document https://fb.quip.com/iQUtAeKIxWpF Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D29379696 Pulled By: gcramer23 fbshipit-source-id: 9cf5fb7398ba2fa3eb694afbddc4ed00d97f205f	2021-06-24 17:21:49 -07:00
Hao Lu	1e31d26b1d	[Static Runtime] Fix bugs in static_runtime::to_copy (#60503 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503 Fixed a few issues in the static_runtime::to_copy impl: - fixed a bug with memory_format - copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit. - fix the schema in the `ReplaceWithCopy` pass - add registration of `static_runtime::to_copy.other` Add more unit tests: - test dynamic shapes - test strided input tensor to `aten::to` - test alias case (same input/output) - test `to.other` Reviewed By: ajyu Differential Revision: D26838933 fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522	2021-06-23 19:57:17 -07:00
Hao Lu	d200e9de26	[Static Runtime] Test for dynamic shapes in SR unit tests (#60579 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60579 - Modify testStaticRuntime to take two sets of inputs so if the second set of inputs have bigger shapes, it would trigger memory allocations in resize_ calls. - Modify test scripts so that the output of the test op is managed by the memory planner, as explained in comments. Reviewed By: ajyu Differential Revision: D29221452 fbshipit-source-id: 09f0f7eb384dc8ca67594f1fa76e1e31392ee6ca	2021-06-23 19:56:05 -07:00
Bert Maher	10e11dbdcd	Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550 Original commit changeset: ed655497a981 Whatever gcc version OSS Bazel uses wasn't happy move-constructing the SimpleIREvaluator, so use a unique_ptr instead. Test Plan: CI. Hope that the gcc version used by OSS Bazel build is happier with this (it should be), since actually testing it locally is an intractable pain. Reviewed By: navahgar Differential Revision: D29333116 fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d	2021-06-23 10:50:03 -07:00
Anjali Chourdia	b14f19b6fe	Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum Test Plan: revert-hammer Differential Revision: D29190420 (`21479ad20c`) Original commit changeset: 86246df82098 fbshipit-source-id: ed655497a981783da4c8f13e2d7fec104e3cb184	2021-06-23 06:59:37 -07:00
Bert Maher	21479ad20c	[nnc][tests] Tests and benchmarks for computeSum (#60160 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60160 Adds a few simple tests and benchmarks for the `computeSum` op (equivalent to `at::sum`). The benchmarks test 1D reduction and 2D row and column reduction. Performance is in the ballpark of aten (14-15 GB/s) on my skylake devserver for all cases, and occasionally better (e.g. 256k * 64 row reduction goes from 9 GB/s to 13). Results (on my skylake-avx512, with turbo disabled): ``` ------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------ Reduce1D/Torch/16777216 4746995 ns 4746722 ns 150 BYTES=14.1379G/s Reduce1D/Naive/16777216 34063215 ns 34061388 ns 21 BYTES=1.97023G/s Reduce1D/NativeRfactor/16777216 5057175 ns 5057167 ns 139 BYTES=13.2701G/s Reduce1D/TeNaive/16777216 33868945 ns 33868851 ns 21 BYTES=1.98143G/s Reduce1D/TeSplitTail/16777216 33902786 ns 33900436 ns 21 BYTES=1.97959G/s Reduce1D/TeSplitMask/16777216 33922509 ns 33920604 ns 21 BYTES=1.97841G/s Reduce1D/TeRfactorV1/16777216 5141150 ns 5141002 ns 135 BYTES=13.0537G/s Reduce1D/Op/16777216 5140390 ns 5140091 ns 135 BYTES=13.056G/s Reduce2DCol/Torch/8/2097152 12824403 ns 12823563 ns 55 BYTES=5.8874G/s Reduce2DCol/Torch/64/262144 8306873 ns 8306743 ns 83 BYTES=8.20507G/s Reduce2DCol/Torch/4096/4096 7992364 ns 7992239 ns 87 BYTES=8.3988G/s Reduce2DCol/OpSchedule/8/2097152/0 4866144 ns 4865766 ns 138 BYTES=15.5161G/s Reduce2DCol/OpSchedule/64/262144/0 36668978 ns 36666415 ns 19 BYTES=1.85885G/s Reduce2DCol/OpSchedule/4096/4096/0 155862459 ns 155801266 ns 4 BYTES=430.839M/s Reduce2DCol/OpSchedule/8/2097152/1 8067683 ns 8061117 ns 85 BYTES=9.36563G/s Reduce2DCol/OpSchedule/64/262144/1 7496686 ns 7496562 ns 93 BYTES=9.09183G/s Reduce2DCol/OpSchedule/4096/4096/1 5262821 ns 5262186 ns 131 BYTES=12.7562G/s Reduce2DCol/OpSchedule/8/2097152/2 6237899 ns 6237210 ns 109 BYTES=12.1044G/s Reduce2DCol/OpSchedule/64/262144/2 5258012 ns 5257655 ns 127 BYTES=12.9635G/s Reduce2DCol/OpSchedule/4096/4096/2 5231686 ns 5228241 ns 132 BYTES=12.839G/s Reduce2DCol/OpSchedule/8/2097152/3 11088573 ns 11087557 ns 62 BYTES=6.80921G/s Reduce2DCol/OpSchedule/64/262144/3 5338843 ns 5338326 ns 127 BYTES=12.7676G/s Reduce2DCol/OpSchedule/4096/4096/3 4311617 ns 4308102 ns 162 BYTES=15.5812G/s Reduce2DRow/Torch/8/2097152 4642244 ns 4641794 ns 151 BYTES=14.4575G/s Reduce2DRow/Torch/64/262144 4628311 ns 4628245 ns 151 BYTES=14.4999G/s Reduce2DRow/Torch/4096/4096 4894012 ns 4893316 ns 143 BYTES=13.7177G/s Reduce2DRow/Torch/262144/64 10469098 ns 10468027 ns 68 BYTES=6.51101G/s Reduce2DRow/Hand/262144/64 5554380 ns 5554059 ns 126 BYTES=12.2716G/s Reduce2DRow/OpSchedule/8/2097152/0 33890363 ns 33888931 ns 21 BYTES=1.98026G/s Reduce2DRow/OpSchedule/64/262144/0 33901317 ns 33899436 ns 21 BYTES=1.97965G/s Reduce2DRow/OpSchedule/4096/4096/0 33500358 ns 33498815 ns 21 BYTES=2.00381G/s Reduce2DRow/OpSchedule/262144/64/0 13132231 ns 13131049 ns 53 BYTES=5.19056G/s Reduce2DRow/OpSchedule/8/2097152/1 5200423 ns 5200025 ns 134 BYTES=12.9055G/s Reduce2DRow/OpSchedule/64/262144/1 5204428 ns 5204327 ns 133 BYTES=12.8949G/s Reduce2DRow/OpSchedule/4096/4096/1 8724355 ns 8723370 ns 80 BYTES=7.69488G/s Reduce2DRow/OpSchedule/262144/64/1 1811861280 ns 1811352083 ns 1 BYTES=37.6279M/s Reduce2DRow/OpSchedule/8/2097152/2 9169829 ns 9168946 ns 76 BYTES=7.31915G/s Reduce2DRow/OpSchedule/64/262144/2 9159901 ns 9158560 ns 76 BYTES=7.32747G/s Reduce2DRow/OpSchedule/4096/4096/2 9217398 ns 9215557 ns 76 BYTES=7.28391G/s Reduce2DRow/OpSchedule/262144/64/2 10820450 ns 10818998 ns 66 BYTES=6.29979G/s Reduce2DRow/OpSchedule/8/2097152/3 5227921 ns 5226544 ns 133 BYTES=12.84G/s Reduce2DRow/OpSchedule/64/262144/3 5194362 ns 5194082 ns 133 BYTES=12.9203G/s Reduce2DRow/OpSchedule/4096/4096/3 5196080 ns 5195349 ns 134 BYTES=12.9203G/s Reduce2DRow/OpSchedule/262144/64/3 5235189 ns 5234728 ns 133 BYTES=13.0202G/s ``` ghstack-source-id: 131753875 Test Plan: these tests Reviewed By: navahgar Differential Revision: D29190420 fbshipit-source-id: 86246df82098da4f5493d6c4f34a40016d95a9f0	2021-06-22 23:04:09 -07:00
Bert Maher	fbeb8b4992	[nnc] Speed up batchnorm benchmark Summary: Use better scheduling: fuse and parallelize NC, fuse and vectorize HW. ``` ----------------------------------------------- N/C/H/W ATen NNC ----------------------------------------------- 1/64/112/112 45449 ns 36672 ns 1/256/14/14 15555 ns 7116 ns 1/128/28/28 15737 ns 8560 ns 1/64/56/56 20766 ns 12153 ns 1/512/7/7 16985 ns 8182 ns 5/64/112/112 2532475 ns 2069668 ns 5/256/14/14 24507 ns 12228 ns 5/128/28/28 29352 ns 20146 ns 5/64/56/56 44786 ns 38784 ns 5/512/7/7 22307 ns 20505 ns ``` Test Plan: benchmark results above Reviewed By: navahgar Differential Revision: D29288658 fbshipit-source-id: dd05efa4b7d26b6ad94f54a9ef6c8c47adb160b5	2021-06-22 22:57:43 -07:00
Raghavan Raman	47bbc01e0b	[nnc] Added micro-benchmark to show perf improvement with cat subgraph optimization (#59581 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59581 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D28955317 Pulled By: navahgar fbshipit-source-id: 53bb3dbfafbd3b146063f305523c2e6ec96cf6b8	2021-06-18 14:32:09 -07:00
Kimish Patel	3176f16691	[Pytorch benchmark] Add BMM benchmark (#59595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595 ghstack-source-id: 130946743 Test Plan: bmm_test Reviewed By: mingzhe09088 Differential Revision: D28873228 fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c	2021-06-10 08:24:29 -07:00
Kimish Patel	8b63573c31	[PyTorch Operator Benchmark] gelu benchmark (#59334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334 Add gelu op benchmark ghstack-source-id: 130947172 Test Plan: gelu_test Reviewed By: hl475 Differential Revision: D28842959 fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4	2021-06-09 16:09:37 -07:00
Zachary Kneupper	b8d56572a1	Open json config file in context manager (#58077 ) Summary: * Open json config file safely using a context manager (using a with block). * This will make sure that the file closed even if an exception is raised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58077 Reviewed By: anjali411 Differential Revision: D28711177 Pulled By: H-Huang fbshipit-source-id: 597ba578311b1f1d6706e487872db4e784c78c3c	2021-05-26 08:58:40 -07:00
Raghavan Raman	dd7bbe1a63	[NNC] Make splitWithMask transform in-place (#58269 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28427227 Pulled By: navahgar fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba	2021-05-25 11:32:51 -07:00
Raghavan Raman	e2467cc43e	[NNC] Make splitWithTail transform in-place (#58268 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D28427228 Pulled By: navahgar fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f	2021-05-25 11:31:14 -07:00

1 2 3 4 5 ...

719 Commits