Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69595
This changes encapsulates `function` object in `ProcessedFunction` objects instead of exposing it unnecessarily just for executing it.
Test Plan: Existing tests
Reviewed By: mikeiovine
Differential Revision: D32908341
fbshipit-source-id: 5ff4951cbe276c5c6292227124d9eec1dd16e364
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68795
This change improves static runtime exception safety. Added a scope exit guard that invokes `MemoryPlanner::deallocate` in its destructor.
Caveat: we have to be really careful with the exception behavior of `MemoryPlanner::deallocate` and `MemoryPlanner`'s constructor, because they're now both potentially called in the destructor of the scope exit guard. Letting exceptions potentially escape destructors is playing with fire since 1) the destructor of `Deallocator` is (implicitly) `noexcept`, 2) even if it wasn't, `std::terminate` will be called if an exception escapes and the stack is already unwinding. To get around this, we wrap the deallocation stuff in a try/catch. If deallocation throws, then we simply reset all of the memory planner stuff and carry on.
There's a catch: the code path that we take when handling the deallocation exception can't throw. However, this code path is much simpler than memory planner construction/deallocation, so it's much easier to manually audit the correctness here.
Test Plan:
**New unit tests**
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D32609915
fbshipit-source-id: 71fbe6994fd573ca6b7dd859b2e6fbd7eeabcd9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68302
Implement the new memory re-use algorithm. It’s roughly based on the c2 one, but after going through many iterations it may not be a 1:1 port anymore. Also deleted the old liveness analysis.
Test Plan:
## **Re-use metrics**
`inline_cvr` (294738512_58)
**Before**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4601984 bytes
Total number of reused tensors: 1183
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29696 bytes
Total number of reused tensors: 959
```
**After**
* `local`
```
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 4520000 bytes
Total number of reused tensors: 1198
```
* `local_ro`
```
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 29120 bytes
Total number of reused tensors: 963
```
Reviewed By: hlu1
Differential Revision: D32370424
fbshipit-source-id: 06a8e0a295ed7a2b4d14071349c1f1e975f746bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69219
This change fixes a bug that `aten::embedding_bag` implementation does not adjust the size of a managed output tensor according to a given input after memory planning starts.
Test Plan: Enhanced `StaticRuntime.EmbeddingBag` to trigger the existing bug that's fixed by this change.
Reviewed By: mikeiovine
Differential Revision: D32544399
fbshipit-source-id: 0a9f1d453e96f0cfa8443c8d0b28bbc520e38b29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67219
I found that these specific test cases were causing different failures when developing D31776259. I also found that it was difficult to debug testStaticRuntime failures, so I added more verbose logs gated behind -v 2.
ghstack-source-id: 144507287
Test Plan: Used during development of D31776259
Reviewed By: hlu1
Differential Revision: D31847566
fbshipit-source-id: ea9147fb246c345d18bbc8d7f3bfba48d3a0fab3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639
Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas
fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)
Reviewed By: mikeiovine
Differential Revision: D32553708
fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68284
Add a new class `ManagedTensorRanges` that determines when manage tensors can be made available for re-use. This class provides a method `availableTensors(Node* node)` that returns a vector of `Value*` (corresponding to managed tensors) that are not used (either directly or through any alias) after `node`.
Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: swolchok
Differential Revision: D32397207
fbshipit-source-id: fb0d9a23f13abf6f2207e3d7266384966f477fc6
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.
Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67943
Reviewed By: ngimel
Differential Revision: D32288709
Pulled By: dzhulgakov
fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68368
Currently, each instance of `StaticRuntime` has its own copy of `std::function` object wrapped in `ProcessedNode::Function` object, in order to invoke actual operation implementation.
However, all instances of `StaticRuntime` derived from same `StaticModule` objects invoke exactly same op implementation, and this is avoidable.
This change adds `StaticModule::functions_` member variable to keep a list of unique instance of `ProcessedFunction` objects. A newly constructed `StaticRuntime` takes `ProcessedFunction`'s pointers instead of the whole function object. This can save a substantial amount of memory per `StaticRuntime` instance.
This comes with a sacrifice in execution time. Now that a `ProcessedNode` instance keeps the function object's pointer, executing a node now involves an extra pointer dereference. However, this cost was proved to be negligible from local performance tests.
Thanks to hlu1 for proposing this non-intrusive improvement idea :D
Test Plan:
This change reduces the size of a StaticRuntime instance by 14.41% (459KB -> 393KB) (patched D32181666 to print the memory turnover from instantiating a StaticRuntime instance) for CMF/local ( & 8% for CMF/local_ro). No noticeable latency regression was observed.
==AFTER
* CMF/local
memory turnover: 393608
latency: PyTorch run finished. Milliseconds per iter: 15.6965. Iters per second: 63.7087
* CMF/local_ro
memory turnover:387288
latency: PyTorch run finished. Milliseconds per iter: 7.51308. Iters per second: 133.101
==BEFORE
* CMF/local
memory turnover: 459888
latency: PyTorch run finished. Milliseconds per iter: 15.8278. Iters per second: 63.18
* CMF/local_ro
memory turnover: 420832
latenfcy: PyTorch run finished. Milliseconds per iter: 7.43756. Iters per second: 134.453
==Confirmation that ptvsc2_predictor_bench reports the same memrmoy management stats for inline_cvr:
==AFTER
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
==BEFORE
Total number of managed tensors: 2660
Total number of managed output tensors: 0
Total number of unmanaged values: 3041
Total memory managed: 1496896 bytes
Total number of reused tensors: 1183
Total number of 'out' variant nodes/total number of nodes: 2452/2469 (99.3115%)
Total number of managed tensors: 1412
Total number of managed output tensors: 0
Total number of unmanaged values: 2677
Total memory managed: 39040 bytes
Total number of reused tensors: 959
Total number of 'out' variant nodes/total number of nodes: 1928/1937 (99.5354%)
Total number of managed tensors: 1293
Total number of managed output tensors: 0
Total number of unmanaged values: 14
Total memory managed: 5293824 bytes
Total number of reused tensors: 771
Total number of 'out' variant nodes/total number of nodes: 1298/1298 (100%)
Reviewed By: swolchok
Differential Revision: D32337548
fbshipit-source-id: e714e735399c93fde337b0f70e203a2de632057a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367
- bmm_test.py was using syntax not allowed in 3.6
- Some suppressions were not placed on the correct line.
With this file,
```
lintrunner --paths-cmd='git grep -Il .'
```
passes successfully.
Test Plan: Imported from OSS
Reviewed By: janeyx99, mrshenli
Differential Revision: D32436644
Pulled By: suo
fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67935
Rationale should be documented in code comments. In short, we
can avoid heap-allocating arrays of input indexes for operators with 5
or fewer inputs, at the cost of a tag bit check on access.
ghstack-source-id: 143429112
Test Plan:
Patched d1jang's D32181666, which prints static runtime memory usage.
Previous diff, local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
```
This diff, local:
```
I1105 12:48:35.820663 1066520 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 338064
```
4.5% savings (16144 bytes)
Ran 10 repetitions of CMF local_ro with core pinning: P467095603. This diff is perf neutral compared to the previous diff.
Reviewed By: hlu1
Differential Revision: D32216573
fbshipit-source-id: d18483db255f75f1d90e610ecded7727c6ffe65c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67934
This reduces the memory requirements of ProcessedNode: by allocating outputs sequentially into a shared array and supporting at most 2**16 - 1 values (current models seem to have 10-20x less than that), we only need to store the 2-byte offset into that array and 2-byte number of outputs in ProcessedNode.
ghstack-source-id: 143429113
Test Plan:
Patched d1jang's diff to measure memory turnover around SR startup.
Previous diff, CMF local:
```
I1104 12:19:39.900211 597593 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 427120
```
This diff, CMF local:
```
I1105 12:17:36.459688 866763 PyTorchStaticRuntimePredictor.cpp:82] memory turnover after creating an instance of StaticRuntime: 354208
72912 bytes (17%) savings
```
Perf looks neutral; see next diff (D32216573) test plan for details.
Reviewed By: hlu1
Differential Revision: D32190751
fbshipit-source-id: 30c1e2caa9460f0d83b2d9bb24c68ccfcef757cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67939
With `manage_output_tensor` enabled, a client of `StaticRuntime` requires to call it via `PyTorchPredictor::predict_managed_result`. If the client uses `PyTorchPredictor::operator()` the client will experience a crash (intended behavior not to leak memory of managed output tensors). This mistake can cause a catastrophic failure in production if that happens (by gatekeeper, config changes, etc).
Considering the complexity in how `PyTorchPredictor` is used in different settings, the chances that this bug can hit production is non-zero.
This change introduces `StaticRuntime::disableManageOutputTensor` to disable `manage_output_tensor` feature when a client mistakenly uses `PyTorchPredictor::operator()` instead of crashing. When `StaticRuntime` is invoked via `PyTorchPredictor::operator()`, it first calls `StaticRuntime::disableManageOutputTensor` to disable the feature, so that it can get non-managed output tensors to pass to the client safely.
A slight perf degradation is expected by forcefully disabling `manage_output_tensors`, but its robustness value outweighs a catastrophic failure of crashes at a high rate.
Test Plan: Added a unittest `StaticRuntime, DisableManageOutputTensors` to cover the newly added code.
Reviewed By: swolchok
Differential Revision: D32219731
fbshipit-source-id: caf5c910b34726c570e17435ede7d888443e90cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67825
The comment explains how it works.
Test Plan:
A small regression to local and local_ro if we only enable it for fallback ops.
```
## local_ro
# before
I1103 21:25:05.250440 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22213. Iters per second: 818.247
I1103 21:25:08.629221 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22351. Iters per second: 817.319
I1103 21:25:12.005179 2636751 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22285. Iters per second: 817.759
I1103 21:25:12.005236 2636751 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.22283, standard deviation: 0.000693619
# after
# # only enable for fall back ops: 0.7%
I1103 21:26:40.190436 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.22928. Iters per second: 813.481
I1103 21:26:43.590443 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23265. Iters per second: 811.262
I1103 21:26:46.992928 2644597 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.23379. Iters per second: 810.51
I1103 21:26:46.992980 2644597 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.23191, standard deviation: 0.0023424
# enable for all (no clone): 4.7%
I1103 21:27:55.291216 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.28204. Iters per second: 780.005
I1103 21:27:58.822347 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27854. Iters per second: 782.14
I1103 21:28:02.354184 2649780 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 1.27958. Iters per second: 781.506
I1103 21:28:02.354240 2649780 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 1.28006, standard deviation: 0.00179765
# local
# before
I1103 21:52:00.784718 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.676. Iters per second: 50.8233
I1103 21:52:28.985873 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.699. Iters per second: 50.7641
I1103 21:52:57.200223 2765168 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.6953. Iters per second: 50.7735
I1103 21:52:57.200273 2765168 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.6901, standard deviation: 0.0123206
# after
# # only enable for fall back ops: 0.1%
I1103 21:45:25.514535 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7103. Iters per second: 50.7349
I1103 21:45:53.773594 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7005. Iters per second: 50.7601
I1103 21:46:21.955680 2734440 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.7398. Iters per second: 50.659
I1103 21:46:21.955729 2734440 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.7169, standard deviation: 0.0204658
# enable for all (no clone): 0.9%
I1103 21:43:22.162272 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8893. Iters per second: 50.2783
I1103 21:43:50.651847 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8566. Iters per second: 50.3611
I1103 21:44:19.068519 2723868 PyTorchPredictorBenchLib.cpp:274] PyTorch run finished. Milliseconds per iter: 19.8793. Iters per second: 50.3037
I1103 21:44:19.068570 2723868 PyTorchPredictorBenchLib.cpp:285] Mean milliseconds per iter: 19.875, standard deviation: 0.0167498
```
Reviewed By: d1jang
Differential Revision: D32124812
fbshipit-source-id: 0f60c26f8fb338d347e4ca7a70b23e5a386fc9aa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67476
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31994040
fbshipit-source-id: 9de57d8d7925ee46544478eae8229952ca5f248a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67941
I just found out that due to the round up of the Tensor storage sizes to multiples of 64 bytes, resizing is not actually triggered for a lot of our unit tests (23 OSS, 16 internal). Now they should be all fixed. Also moved a bunch of tests to `test_static_module.cc` so that `test_static_runtime.cc` now only contains operator tests.
From now on, by default if `args2` is passed to `test_static_runtime`, at the end of the second iteration, it would check that the managed buffer's size is bigger than the previous size and enforce that. You can bypass the check for ops with constant output sizes, such as `aten::sum` without `dim` passed in.
Test Plan:
Facebook
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/benchmarks/static_runtime/fb:test_fb_operators
```
Reviewed By: swolchok
Differential Revision: D32196204
fbshipit-source-id: 8425d9efe6b9a1c1e3807e576b1143efd7561c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67856
Returns a tensor constructed from scalar input
Test Plan:
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Ran
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=*NumToTensorScalar* --v=1
```
and the output contains `Switch to out variant for node: %2 : Tensor = prim::NumToTensor(%0)`.
Reviewed By: mikeiovine
Differential Revision: D32014194
fbshipit-source-id: e7df65ea1bf05d59c1fc99b721aee420e484f542
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67437
Certain ops do nothing on the forward pass and can be discarded after training: `aten::detach` and `fb::scale_gradient` are examples of this.
Test Plan: `buck test caffe2/test:jit -- test_freezing`
Reviewed By: hlu1
Differential Revision: D31980843
fbshipit-source-id: 0045b6babcfae786a2ce801b2f5997a078205bc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67441
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31992093
fbshipit-source-id: 88191c13d229ffeac4e5b17b78e25f51d3f7f23e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67550
`aten::__is__` and `aten::__isnot__` are extremely problematic for a large number of SR graph optimizations.
Some examples:
- Removing ops that are no-ops in the forward pass like `aten::detach`. This would normally be trivial, but `is` introduces corner cases like this:
```
def forward(x):
y = x.detach()
return x is y
```
We get `False` before optimizations. But after optimizations, the test becomes `x is x`, and we get `True`.
- `ReplaceWithCopy`: the pass that replaces ops like `aten::to` with an out variant that copies its input. The following graph returns `True` before optimizations, but `False` afterwards
```
def forward(x):
y = x.to(x.dtype)
return x is y
```
- And many more, `FuseListUnpack` can break too
Since the ops are not used by 99.99% of users, rejecting them so we don't have to think about this is not a big deal.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D32022584
fbshipit-source-id: d135938edb2299c9b8f9511afac2bf568578879e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67346
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like TupleUnpack). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: d1jang
Differential Revision: D31965159
fbshipit-source-id: 86a69c395f401c4a4c55daa4c5fe80764383c8e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67341
Native ops are faster than falling back to the JIT interpreter, sometimes significantly (we've previously seen this with ops like `TupleUnpack`). We should improve op coverage where possible.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31962589
fbshipit-source-id: 3107fb169c1b02fb2bafbb355c005669b5fa8435
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67255
Add an out variant for `aten::where`.
Since this op can be implemented quite trivially in NNC with `ifThenElse`, I added an NNC kernel as well.
Test Plan: Unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: navahgar
Differential Revision: D31923886
fbshipit-source-id: b4379ee3aaf31a000e626b4caeafd3e3f3d60837
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67001
The overload of `operator()` taking `std::vector<at::Tensor>` was only used for testing. In a diff following this one, I will add a new overload that takes `std::vector<c10::IValue> args` and no `kwargs` so we can avoid default-constructing `kwargs` everywhere.
This new overload will probably take a forwarding reference, so to avoid problems with overloading on forwarding reference and simplify the interface, it's best to remove this unused one.
Test Plan:
`buck test caffe2/benchmarks/static_runtime/...`
`buck test caffe2/test:static_runtime`
Reviewed By: hlu1
Differential Revision: D31821990
fbshipit-source-id: 6d2e4a75ca4abe6e262651532eb96c3b274c6f4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66648
Currently, SR shallow-copies its `IValue` inputs when running inferences. We can avoid refcount bumps by `std::move`-ing the inputs into their slots. To achieve this, I've made the following changes:
1. Add an overload for `set_inputs` that takes a `std::vector<IValue>&&`.
2. Change the signatures of `StaticModule::operator()` and `StaticRuntime::operator()`.
Old:
```
operator()(const std::vector<IValue>& args, const std::unordered_map<std::string, IValue>& kwargs)
```
New:
```
template <class IValueList>
operator()(IValueList&& args, const std::unordered_map<std::string, IValue>& kwargs)
```
The implementations use perfect forwarding to invoke the correct overload of `set_inputs`.
Test Plan: Added a short new unit test to exercise the new code path. All other unit tests still pass.
Reviewed By: hlu1
Differential Revision: D31659973
fbshipit-source-id: b8c194405b54a5af1b418f8edaa1dd29a061deed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66940
`aten::index`'s schema is as follows:
```
"aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
```
The current implementation assumes `indices`' elements are all tensors by doing `elem.toTensor`, which is incorrectly. This change creates an empty optional value if an element from `indices` is not a tensor.
Test Plan: Fixed `StaticRuntime, IndividualOps_Index` to correctly test `aten::index` with `indices` that contains `None`.
Reviewed By: hlu1
Differential Revision: D31712145
fbshipit-source-id: be1c29674bcd55b67b0dcc2a988bc37fd43745f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64181
This PR replaces all the calls to:
- `transpose(-2, -1)` or `transpose(-1, -2)` by `mT()` in C++ and `mT` in Python
- `conj().transpose(-2, -1)` or `transpose(-2, -1).conj()` or `conj().transpose(-1, -2)` or `transpose(-1, -2).conj()` by `mH()` in C++ and `mH` in Python.
It also simplifies two pieces of code, and fixes one bug where a pair
of parentheses were missing in the function `make_symmetric_matrices`.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D31692896
Pulled By: anjali411
fbshipit-source-id: e9112c42343663d442dc5bd53ff2b492094b434a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525
This should solve https://github.com/pytorch/pytorch/issues/60015
There were two `q_zero_point()` accesses inside a for loop which was
expensive. Moving them to before the loop sped things up 10x for a
microbenchmark.
Test Plan:
```
// comment out benchmarks unrelated to original issue, for simplicity
cd benchmarks/operator_benchmark
python -m pt.qinterpolate_test
// before: 2994 us
// after: 324 us
// full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D31592422
fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65429
The sizes of these arrays can't change, so there's no need to waste an extra pointer on them.
ghstack-source-id: 140532722
Test Plan:
CI
I profiled this diff and the previous diff together. Comparing time spent in the operator functor handler for to_copy, I see the load instruction fetching the inputs pointer from p_node on https://www.internalfb.com/code/fbsource/[4c98a83b2451fa6750f38796c91ebb0eb0afd800]/fbcode/caffe2/torch/csrc/jit/runtime/static/ops.cpp?lines=947 (`p_node->Input(0).toTensor()`) improved a tiny bit, and the overall time spent in that wrapper decreased from 0.8% to 0.7%.
Reviewed By: hlu1
Differential Revision: D31096042
fbshipit-source-id: 35c30462d6a9f9bd555d6b23361f27962e24b395
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66557
The test was previously using `at::empty_strided` to initialize one of its inputs. The contents of the tensor returned by this function are random, uninitialized memory. If we happened to get a NaN, this test would fail since `use_equalnan` was not set.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31611961
fbshipit-source-id: 79a9476d0d6ce7a9f1412eefcef19bc2618c54b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65515
This change enables `StaticRuntime` to manage output tensors (returned from a graph) as follows:
- At the creation of `StaticModule`, it gathers a set of candidates for output tensors (& their aliases) for managing. This is done by `ValueGroup` introduced by the previous diff.
- At the end of the 1st iteration, `MemoryPlanner` creates a set of output `at::Tensor*` to manage. This set consists of tensors objects from the aforementioned candidates, excluding the direct output value of the graph to simplify ivalue ownership passing (`std::move(ivalue)` to return from SR). Note that this exclusion has no perf implication for inline_cvr & ctr_mobilefeed since they only return a container object (e.g., tuple).
- The 2nd+ iterations preallocates a slab memory and all identified output tensors during the 1st iteration. Note that these preallocated tensors are *NOT* deallocated when returned from SR. The client receives the output tensors, and completes using them, and is responsible to call `StaticRuntime::deallocateOutputTensors()` to deallocate them. This mandates that SR cannot be reentered until `deallocateOutputTensors` is called by the client.
- In case of a buggy client missing a call to `StaticRuntime::deallocateOutputTensors()`, SR throws an exception when reentered instead of leaking memory.
- Nit: I plan to use camlcase for function names, and so all newly introduced functions use camlcase despite inconsistencies with snakecase. We can gradually fix the inconsistencies.
This change will be followed by another one to enable `manage_output_tensors` from `PyTorchScriptPredictor`, starting with `ptvsc2_prediction_bench` as a testbed.
Test Plan:
- Added `StaticRuntime.ManageOutputTensors*` to cover the newly added code paths.
- Enhanced `testStaticRuntime` to exercise each unittest test case with `manage_output_tensors` on. Confirmed that SR actually managed output tensors successfully for a few existing testcases (e.g., StaticRuntime.EmbeddingBag`).
Reviewed By: hlu1
Differential Revision: D31049221
fbshipit-source-id: 4ad1599179cc7f00d29e0ce41b33f776226d4383
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65517
This change retrofits `GetAlwaysAliveValues` into `ValueGroup` to group the values used by a graph into three groups as follows:
- input_aliases: values that are either inputs or contain aliases of inputs or constants.
- output_aliases: values that are either outputs or contain aliases of outputs and are not in input_aliases.
- Values that dont't show up in input_aliases and output_aliases are internally created consumed within the graph.
`output_aliases` is the only new group introduced by this change, and a following diff will use this to preallocate output Tensors to accelerate Static Runtime's performance.
Test Plan: Added `ValueGroup.Init` to cover the updated code path. Note that there was no test for `GetAlwaysAliveValues` before.
Reviewed By: hlu1
Differential Revision: D30940955
fbshipit-source-id: 2cb065ecda0f447a61e64a7cf70cc7c6947f7dfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66161
`aten::add` is not guaranteed to be bit exact with the JIT interpreter. This was causing non-deterministic test failures on master.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31406764
fbshipit-source-id: d968cb1bdb8f33934682ef3712a1341a3aacf18e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65849
Add tests for some of `StaticModule`'s exposed methods. Both of these are used by the memory planner, so it would be helpful to have some unit tests that ensure our basic invariants don't break.
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D31282901
fbshipit-source-id: e390329f4794e034170507e3a0de0abcfe0ab7b9
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Do not delete `caffe2::OperatorBase::Output` calls as they have side effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66041
Reviewed By: ngimel
Differential Revision: D31360142
Pulled By: malfet
fbshipit-source-id: 6fdfb9f91efdc49ca984a2f2a17ee377d28210c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65516
This change fixes a bug that Static Runtime's `aten::embedding_bag` out variant implementation creates aliases in its managed output tensors.
Managed output tensors should never be an alias with each other since writing to them can illegally overwrite others' contents unintentionally, and this exact problem was causing the bug at T97393697, causing SR to return wrong return values.
This bug is detected in inline_cvr/remote_ro by a DCHECK, `verify_no_memory_overlap` (introduced by D30211705 (3fb33b38b9)), but wasn't found so far since our testing didn't include running the model in the debug mode. Fortunately this bug is not hitting production since the aliases outputs are not used in production.
This change fixes the root cause from `_embedding_bag_cpu_impl_out` by replacing alias creation with copying.
Note that this change also includes a fundamental change in Static Runtime's unit testing: `testStaticRuntime` exercises the given graph 3 times:
1. profile run
2. run using the profile to allocate managed tensors
3. reuse the managed tensors -- newly added
Adding 3 reveals this bug with a new unittest `EmbeddingBagWithManagedOutput`.
Test Plan:
- Confirmed that the crash experienced by `StaticRuntime.EmbeddingBagWithManagedOutput` disappears with this change (crash paste: P459807248).
- Added `StaticRuntime.EmbeddingBagWithManagedOutput` to detect the same problem in the future.
Reviewed By: hlu1
Differential Revision: D31104345
fbshipit-source-id: 7bddf9cd82b400d18d8ce1bf15e29b815ef9ba8f
Summary:
Delete `-Wno-unused-variable` from top level `CMakeLists.txt`
Still suppress those warnings for tests and `torch_python`
Delete number of unused variables from caffe2 code
Use `(void)var;` to suppress unused variable in range loops
Use `C10_UNUSED` for global constructors and use `constexpr` instead of `static` for global constants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65954
Reviewed By: ngimel
Differential Revision: D31326599
Pulled By: malfet
fbshipit-source-id: 924155f1257a2ba1896c50512f615e45ca1f61f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65499
When the tensors in question are contiguous, there is no need to go through dispatch, use TensorIterator, etc.
ghstack-source-id: 139549027
Test Plan:
Ran ptvsc2_predictor_bench for ctr_mobile_feed local net following https://fb.quip.com/q8hBAFGMeaOU (but without the profile and compare_results options).
Before:
I0922 14:00:32.261942 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.18124. Iters per second: 139.252
I0922 14:01:44.865965 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.25314. Iters per second: 137.871
I0922 14:02:56.929602 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.1986. Iters per second: 138.916
I0922 14:04:05.923025 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.89211. Iters per second: 145.093
I0922 14:05:17.953056 3132627 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 7.19577. Iters per second: 138.971
mean: 7.144172, stddev: 0.1283
After:
I0922 13:51:55.233937 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.79709. Iters per second: 147.122
I0922 13:53:03.062682 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.77605. Iters per second: 147.579
I0922 13:54:10.230386 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.70993. Iters per second: 149.033
I0922 13:55:18.403434 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.81044. Iters per second: 146.833
I0922 13:56:26.568646 3086245 PyTorchPredictorBenchLib.cpp:312] PyTorch run finished. Milliseconds per iter: 6.80965. Iters per second: 146.85
mean: 6.800632, stddev: 0.013227
Looks like about a 5.3% improvement.
Reviewed By: hlu1
Differential Revision: D31125492
fbshipit-source-id: 92ab5af242d0a84dcf865323a57b48e8374eb823
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65551
Previously we had a big switch on Op kind to decide how to lower a given
JIT operator to NNC. This PR changes this switch to a hash table lookup.
Why? This helps us with at least two things:
1) With this approach we can easily check if we know how to handle a
given node in advance - i.e. we can inspect the entire graph and tell
whether it's possible to compile it or not without actually trying to do
that and dying in the middle. This would allow us to, say, provide
user-friendly error messages in AOT workflow.
2) We can switch to use schema instead of op kind to determine correct
lowering. Unlike op schema, op kind might be ambigous (see e.g. #64963)
and using it instead of schema can lead to bugs.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D31148926
Pulled By: ZolotukhinM
fbshipit-source-id: ac12684e2126c899426ef5e4cc1e3f70fa01f704
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65741
This op previously assumed `axis == 1`, causing graphs that would otherwise be valid to return incorrect results after fusing.
Reviewed By: hlu1
Differential Revision: D31234944
fbshipit-source-id: 89885a3b119357698ebd9fd429b009813260a2f4
Summary:
This PR attempts to port `baddbmm` and `bmm` to structured kernels. The reason it's in the same PR: because a lot of it is common for both the ops, including the checks and implementation.
Issue tracker: https://github.com/pytorch/pytorch/issues/55070
cc: ysiraichi ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64805
Reviewed By: gchanan
Differential Revision: D31134454
Pulled By: ezyang
fbshipit-source-id: 3294619834a8cc6a0407aea660c556d3a42b6261
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65387
Added a customized NNC implementation for signed log1p kernel and enabled the fusion pass that adds the fused signed log1p op.
Also, added a SR microbenchmark for this kernel which shows the performance improvement.
Without fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 1953 ns 1953 ns 358746
BM_signed_log1p/64 2049 ns 2049 ns 342145
BM_signed_log1p/512 3291 ns 3291 ns 214342
BM_signed_log1p/4096 15559 ns 15559 ns 44420
BM_signed_log1p/32768 101936 ns 101935 ns 6843
BM_signed_log1p/65536 194792 ns 194789 ns 3615
```
With NNC fusion:
```
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_signed_log1p/16 369 ns 369 ns 1896179
BM_signed_log1p/64 497 ns 497 ns 1406995
BM_signed_log1p/512 1618 ns 1618 ns 430209
BM_signed_log1p/4096 11327 ns 11326 ns 61463
BM_signed_log1p/32768 84099 ns 84086 ns 8325
BM_signed_log1p/65536 166531 ns 166510 ns 4186
```
This clearly shows >15% improvement in performance of this kernel with NNC fusion.
On inline_cvr local model, there is a small improvement in terms of profiled time spent on ops:
without fusion: `0.9%` (computed by adding the % spent on all the 4 ops involved)
with NNC fusion: `0.55%`
Test Plan:
`buck test mode/opt-clang //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`
Also, did the accuracy test with inline_cvr as described here, https://fb.quip.com/qmdDAJzEmPtf, on the full size model (285298536_1)
```
get 57220 prediction values
get 57220 prediction values
max_error: 0 total: 0
```
Reviewed By: hlu1
Differential Revision: D30609492
fbshipit-source-id: d2e68df580569a30ee61abb0ef18d2c4c56827bd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65118
Cloning the module can increase memory use. By freezing the module directly without cloning it first, we can avoid this memory usage increase.
Reviewed By: eellison, movefast1990
Differential Revision: D30955053
fbshipit-source-id: 2feb738eddcf66aa68c92bf695cc05b57bd990f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64934
Add a new op `static_runtime::VarTupleUnpack` and a graph pass transforming graph sequences from:
```
%0, %1 = prim::TupleUnpack(%a)
%2, %3 = prim::TupleUnpack(%b)
```
into:
```
%0, %1, %2, %3 = static_runtime::VarTupleUnpack(%a, %b)
```
The pass is only applied to contiguous blocks of `TupleUnpack` nodes. This is the most straightforward way to guarantee correctness, and it is sufficient for the models we care about.
Test Plan: New unit tests: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- VarTupleUnpack`
Reviewed By: d1jang
Differential Revision: D30872109
fbshipit-source-id: 1ed4a7e201c532da28f703a3a50241c392a6c7e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65123
This change re-reverts D30883290 (0e11454d19). D30883290 (0e11454d19) broke the OSS build since the change in this change implicitly removed the default move constructor of `StaticRuntime`.
```
ep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:95:10: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57 return torch::jit::StaticRuntime(*smod);
Sep 15 15:39:57 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /var/lib/jenkins/workspace/benchmarks/static_runtime/deep_wide_pt_bench.cc:99:9: error: call to implicitly-deleted copy constructor of 'torch::jit::StaticRuntime'
Sep 15 15:39:57 auto sr = getStaticRuntime();
Sep 15 15:39:57 ^ ~~~~~~~~~~~~~~~~~~
Sep 15 15:39:57 /var/lib/jenkins/workspace/torch/csrc/jit/runtime/static/impl.h:321:34: note: copy constructor of 'StaticRuntime' is implicitly deleted because field 'planner_' has a deleted copy constructor
Sep 15 15:39:57 std::unique_ptr<MemoryPlanner> planner_;
Sep 15 15:39:57 ^
Sep 15 15:39:57 /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/unique_ptr.h:356:7: note: 'unique_ptr' has been explicitly marked deleted here
Sep 15 15:39:57 unique_ptr(const unique_ptr&) = delete;
Sep 15 15:39:57 ^
Sep 15 15:39:57 2 errors generated.
```
This change fixes the issue by explicitly defining the default move constructor (courtesy of mikeiovine).
Original Summary:
This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.
`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.
This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.
Test Plan: - Confirm that OSS build went well (See External Tests section).
Reviewed By: mikeiovine
Differential Revision: D30983292
fbshipit-source-id: a59f407fa1123527824157268111144a1bf58116
Summary:
Syncing nvfuser code base from devel branch, Listing a few of our development since last sync:
- Extends support to normalization and reduction kernels.
- Multiple kernel launch for single `CudaFusionGroup`. Hierarchical caching system has been updated to cache graph segmentation.
- profile_ivalue is enabled to convert dynamic scalar into compile time constants, which are required by the codegen. (e.g. reduction axes).
To keep this PR simple and relatively review-free. We stripped most external changes and submitted them as separate PRs, so this gigantic PR is easier to handle.
internal updates are files located in:
1. updates in nvfuser codegen `torch/csrc/jit/coddgen/cuda`
2. added nvfuser specific benchmarks `benchmarks/cpp/nvfuser`
3. nvfuser jit cpp tests `test/cpp/jit/test_gpu.cpp` `test/cpp/jit/test_gpu_shift.cpp` `test/cpp/jit/test_gpu_validator.h`
updates affecting integration:
1. profile_ivalue enabled for nvfuser. related changes are in `torch/csrc/jit/runtime/*`,
2. exposed a few more symbols `aten/src/ATen/core/*` used by codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63745
Reviewed By: saketh-are
Differential Revision: D30752939
Pulled By: malfet
fbshipit-source-id: ce122e80f01bcd3865f5bd3c4dfde660665fd84c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013
This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.
This check will detect a problem like T97393697 immediately in debug mode.
Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`
- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.
Reviewed By: hlu1
Differential Revision: D30211705
fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887
BufHandle has exactly the same functionality and should be used instead.
Differential Revision:
D30889483
D30889483
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707
Use torch.randn instead of torch.from_numpy to generate the tensor
Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test
Reviewed By: jingsh
Differential Revision: D30817302
fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64205
The log_vml version of the micro-bench is over **2x** faster than the log1p version. Here are the perf numbers:
```
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------
SignedLog1pBench/ATen/10/1467 45915 ns 45908 ns 14506 GB/s=2.5564G/s
SignedLog1pBench/NNC/10/1467 40469 ns 40466 ns 17367 GB/s=2.9002G/s
SignedLog1pBench/NNCLogVml/10/1467 19560 ns 19559 ns 35902 GB/s=6.00016G/s
```
Thanks to bertmaher for pointing this out.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D30644716
Pulled By: navahgar
fbshipit-source-id: ba2b32c79d4265cd48a2886b0c62d0e89ff69c19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647
Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim.
Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test```
Reviewed By: jingsh
Differential Revision: D30770085
fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64209
Add a new fusion pass that turns transforms the following pattern:
```
graph(%input):
%0 : Tensor = aten::sign(%input)
%1 : Tensor = aten::abs(%input)
%2 : Tensor = aten::log1p(%1)
%res : Tensor = aten::mul(%0, %2)
return (%res)
```
Into a single op:
```
graph(%input):
%res : Tensor = static_runtim::signed_log1p(%input)
return (%res)
```
The intent is to reduce the number of passes over the tensor. However, enabling this pass actually causes a performance regression, probably due to a lack of vectorization in the fused implementation. Because of this issue, this diff **does not** enable this pass.
Followup: navahgar will add an NNC kernel which is faster than the the unfused version and enable this pass. We still need this version as a fallback since the NNC kernel will not support all dtypes.
Test Plan:
`buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- SignedLog1p`
Test passed with new graph pass disabled and enabled.
Reviewed By: hlu1
Differential Revision: D30559929
fbshipit-source-id: e4e080cb2e6a705cfdde1fc98bee92b723f8132a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64159
Test Plan:
Confirm out variant is called for both versions:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: mikeiovine
Differential Revision: D30622819
fbshipit-source-id: a2c8c7f969dae5f507718fb3d513e1fb4f026736
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64157
UseVariadicCat optimization is not applied to aten::cat if list input to the op can not be moved to the position before op (https://fburl.com/diffusion/l6kweimu). For these cases we will need out version for SR.
Test Plan:
Confirm out variant is called:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: d1jang
Differential Revision: D30598574
fbshipit-source-id: 74cfa8291dc8b5df4aef58adfb1ab2a16f10d90a
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64070
Test Plan:
Confirm out variant is called for both versions:
```
> buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --v=1
```
Reviewed By: d1jang
Differential Revision: D30595816
fbshipit-source-id: e88d88d4fc698774e83a98efce66b8fa4e281563
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078
This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.
Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
at::Tensor mean = create_empty_from({M}, *X);
at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.
This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.
Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.
- Confirmed that the new op gets activated during testing:
```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)
```
Reviewed By: hlu1
Differential Revision: D30486475
fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64024
`aten::expand_as` creates a view of the input tensor. This change adds its native op implementation for the static runtime.
Test Plan: - Added `StaticRuntime.IndividualOps_ExpandAs`
Reviewed By: hlu1
Differential Revision: D30546851
fbshipit-source-id: e53483048af890bc41b6192a1ab0c5ba0ee2bdc0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579
Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`
Reviewed By: navahgar
Differential Revision: D30410525
fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587
Now that there is no classes using KernelArena for memory management we
can remove it.
Differential Revision:
D30429115
D30429115
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586
This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.
After this change nothing uses KernelScope/KernelArena and they can be
safely removed.
Differential Revision:
D30429114
D30429114
Test Plan: Imported from OSS
Reviewed By: navahgar
Pulled By: ZolotukhinM
fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63778
This is a preparation for a switch from raw pointers to shared pointers
as a memory model for TE expressions and statements.
Test Plan: Imported from OSS
Reviewed By: navahgar
Differential Revision: D30487425
Pulled By: ZolotukhinM
fbshipit-source-id: 9cbe817b7d4e5fc2f150b29bb9b3bf578868f20c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63398
This change provides a native `__getitem__` implementation for lists to avoid overhead associated with falling back to the JIT interpreter.
Test Plan: Unit tests: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D30368464
fbshipit-source-id: e0e0971508cd5d9bcf6025606993dc24ecbf6764
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63350
Add a native implementation for `aten::append`, the list append op.
Test Plan: New unit test: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Append`
Reviewed By: hlu1
Differential Revision: D30326461
fbshipit-source-id: 0dbdf6cc82e78c7c36db39583256f6b87385e3d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62347
This diff includes tests for all `aten` ops that did not already have test coverage.
Test Plan: `buck test //caffe2/benchmarks/static_runtime/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D29968280
fbshipit-source-id: 768655ca535f9e37422711673168dce193de45d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62335
This change ensures that unittests only use out variants or native ops.
- Our unittests currently assume that a graph fed to the static runtime correctly replaces an interpreter op for its corresponding out variant / native op, but it's not checked by the unittest. This change ensures that.
- We relied on manual inspection of log messages to see if an out variant is used for a specific workload even for unittesting. This change frees us from doing that.
- `aten::add` is excluded from this check since it's only enabled for an internal workload. Also some unittests are excluded by using `expect_interpreter_op = true` since they are written to use interpreter ops by design.
Test Plan: Ran `buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest` successfully.
Reviewed By: mikeiovine, hlu1
Differential Revision: D29952381
fbshipit-source-id: e60e70b80ccf45e91c6654b4ad53f92ffd5ab702
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662
Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface.
Reviewed By: SciPioneer
Differential Revision: D30012869
fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62622
This allows us to catch cases where an out variant is being tested but the test author forgot to call `.clone()` in the test script. More than 2 ops does not guarantee that the memory planner is being exercised, but less than 2 guarantees that it is not being used.
Reviewed By: hlu1
Differential Revision: D30058050
fbshipit-source-id: 5bc053736f1cc6fd1ffcf8254bf38874ac18c34b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62064
`testStaticRuntime` was previously only available in `test_static_runtime.cc`. It has been moved to a common library `test_utils` to facilitate code re-use. This also lets us test dynamic shapes in `test_fb_operators`
Reviewed By: hlu1
Differential Revision: D29858928
fbshipit-source-id: 68a94760166ddb745972b0f1fc24bed594937d1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62067
The wrapper for aten::cat is no longer needed after the variadic cat change in D29565344 (ae58a4c45d) .
Also added a simple test to test dynamic shapes, i.e., input tensors in args2 are larger than in args1.
Reviewed By: navahgar, mikeiovine
Differential Revision: D29864600
fbshipit-source-id: 44a712c2e776815c09e0bf5631412149b81274b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62098
The build was broken by D29821533 (1d2ea76afb). The `clamp` overloads used in `deep_wide.h`
are no longer available in the `at::native` namespace.
Use `at::cpu::clamp` and `at:🗜️:clip_out` (which should be an alias for
clamp) instead.
Reviewed By: hlu1
Differential Revision: D29880187
fbshipit-source-id: 210b6d2be8a8142e7af1a0ba07e55a95b1a77d25
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61783
Implement two new prim operators for static runtime: `isinstance` and `TypeCheck`. `isinstance` is very straightforward, but there were a few wrinkles with implementing `TypeCheck`:
1. There is no way to directly generate `TypeCheck` nodes from TorchScript, they are generated by the JIT at runtime. This makes testing a little difficult. I had to make some modifications to `testStaticRuntime` to allow for the use of IR and TorchScript tests.
2. The behavior of `prim::TypeCheck` as implemented here does not match up 1:1 with the version implemented in the interpreter! This is because grad mode is disabled in static runtime. Here's an example.
IR is the same as the one included in this test, but with `requires_grad == 1`
```
graph(%a.1 : Tensor,
%b.1 : Tensor):
%t0 : Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), %t1 : Float(3, 3, strides=[3, 1]), %type_matched : bool = prim::TypeCheck[types=[Float(2, 2, strides=[2, 1], device=cpu, requires_grad=1), Float(3, 3, strides=[3, 1])]](%a.1, %b.1)
return (%t0, %t1, %type_matched)
```
And in the test setup:
```
auto a = at::zeros({2, 2}, at::kFloat);
a.to(at::kCPU);
a.set_requires_grad(true);
auto b = at::ones({3, 3}, at::kFloat);
std::vector<IValue> args_correct = {a, b};
// prim::TypeCheck should be true with args_correct,
// but we get false when using static runtime!
```
Reviewed By: hlu1
Differential Revision: D29743862
fbshipit-source-id: db1788f0f5de42bab42602e8cc24eee04cbcc280
Summary:
There're a few convoluted logic here to fix the `benchmarks`'s import module for pytest.
- On one hand, if we want to use `tools.stats.scribe` from `benchmarks`, we will need to add `benchmarks/__init__.py`
- On the other hand, if we add `benchmarks/__init__.py`, it breaks how `pytest` is working on searching what is the system built `torch` instead of the local source module `../torch`
- That's why we are seeing errors like
```
ImportError while loading conftest '/var/lib/jenkins/workspace/benchmarks/fastrnns/conftest.py'.
benchmarks/fastrnns/__init__.py:1: in <module>
from .cells import * # noqa: F403
benchmarks/fastrnns/cells.py:1: in <module>
import torch
torch/__init__.py:29: in <module>
from .torch_version import __version__ as __version__
torch/torch_version.py:9: in <module>
from .version import __version__ as internal_version
E ModuleNotFoundError: No module named 'torch.version'
```
Instead, this PR changed the usage of `upload_scribe.py` back to its original form using HTTP request, and only circleci for now will continue the this path using the `python benchmarks/upload_scribe.py`, which is gated by `if [[ -z "${GITHUB_ACTIONS}" ]];`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61808
Reviewed By: seemethere
Differential Revision: D29750188
Pulled By: zhouzhuojie
fbshipit-source-id: 3b842b21978f2159001e9c6c1cdc96c5a0515f2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61595
Add out variant wrapper for `aten::linear` in the static runtime
Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest`
Reviewed By: hlu1
Differential Revision: D29684236
fbshipit-source-id: 94df6d7267b3f269b2cadf065f207648777147df
Summary:
Related to https://github.com/pytorch/pytorch/issues/61632
This PR adds
- refactoring of scribe related code to scribe.py
- changed the `render_test_results` job to always use the `linux.2xlarge` runner
- if SCRIBE_GRAPHQL_ACCESS_TOKEN is empty, try boto3 instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61675
Reviewed By: seemethere
Differential Revision: D29703523
Pulled By: zhouzhuojie
fbshipit-source-id: 829ad3630d3500a498b41aa458ce6539aaeae938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61507
Benchmark Python-only DDP vs production C++ based DistributedDataParallel.
- Implemented a pure python DDP: PythonDDP with support of SYNC and ASYNC reduction
- Added compare_ddp to measure the difference in forward and backward step
Kudos on Shen and Yi for the great idea.
Test Plan:
Test on DevGPUS with 2 CUDA devices.
$python compare_ddp.py
Python only DDP has slightly better (-1%) forward performance and slightly slower (2%-20%) backward performance.
This suggested that we need to keep C++ Core since the maximum latency increase can be 20%. See README.md for details.
Imported from OSS
Differential Revision:
D29685364
D29685364
Reviewed By: mrshenli
Pulled By: bowangbj
fbshipit-source-id: 429e4473fac0ec4c70d6db12d946d2636dd6477a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61566
This change uses `at::allclose` to compare results from sigmoid functions (CPU/NNC) instead of `Tensor::equals` due to numerical errors occurring between them.
Test Plan:
I confirmed that the flakiness of `StaticRuntime.Sigmoid` is gone with this change:
```
[djang@devvm1999.ftw0 ~/fbsource/fbcode] buck-out/gen/caffe2/benchmarks/static_runtime/static_runtime_cpptest -v 3 --gtest_filter=StaticRuntime.Sigmoid --gtest_repeat=100 &> output.txt
[djang@devvm1999.ftw0 ~/fbsource/fbcode] grep PASSED output.txt | wc
100 500 2100
```
Reviewed By: bertmaher
Differential Revision: D29671203
fbshipit-source-id: 99a7b16d18ea047c9aad444f36d8368f9d0b088d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301
This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs.
Test Plan:
Added unittests as follows:
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments`
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments`
Reviewed By: hlu1
Differential Revision: D29564158
fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61000
Add unit tests to bmm and addmm operators in static runtime.
Test Plan:
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
{F628935117}
Reviewed By: hlu1
Differential Revision: D29459679
fbshipit-source-id: 5c7fa5c9b0675c1c84f3ae3110204d663255009c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57334
Here's a possibly controversial PR. These counters got in the way of
generalizing the fuser tests to handle arbitrary devices, and I guess I'm just
generally skeptical that they provide much value. While true that they let us
observe whether fusion groups were created, we already have assertions based on
the shape of the graph, and I'm not sure that I trust those any less than these
counters.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D29471484
Pulled By: bertmaher
fbshipit-source-id: f6d76f6e72dbfb581acff1d834b0c74500941b57
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669
Test Plan: Added unit test to check for nested outputs.
Reviewed By: ajyu
Differential Revision: D29322025
fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631
Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small.
PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, albanD
Differential Revision: D29356673
Pulled By: bhosmer
fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60503
Fixed a few issues in the static_runtime::to_copy impl:
- fixed a bug with memory_format
- copy strides when appropriate. This is necessary to make sure that the fbgemm path in the copy kernel gets hit.
- fix the schema in the `ReplaceWithCopy` pass
- add registration of `static_runtime::to_copy.other`
Add more unit tests:
- test dynamic shapes
- test strided input tensor to `aten::to`
- test alias case (same input/output)
- test `to.other`
Reviewed By: ajyu
Differential Revision: D26838933
fbshipit-source-id: ec0d1a2deebe998fcfe8858e772e1ef429cb4522
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60579
- Modify testStaticRuntime to take two sets of inputs so if the second set of inputs have bigger shapes, it would trigger memory allocations in resize_ calls.
- Modify test scripts so that the output of the test op is managed by the memory planner, as explained in comments.
Reviewed By: ajyu
Differential Revision: D29221452
fbshipit-source-id: 09f0f7eb384dc8ca67594f1fa76e1e31392ee6ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60550
Original commit changeset: ed655497a981
Whatever gcc version OSS Bazel uses wasn't happy move-constructing the
SimpleIREvaluator, so use a unique_ptr instead.
Test Plan:
CI. Hope that the gcc version used by OSS Bazel build is
happier with this (it should be), since actually testing it locally is
an intractable pain.
Reviewed By: navahgar
Differential Revision: D29333116
fbshipit-source-id: c3e4b5d8c91eb96a43ae5315a01ca0c0f4d4a99d
Summary:
* Open json config file safely using a context manager (using a with block).
* This will make sure that the file closed even if an exception is raised.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58077
Reviewed By: anjali411
Differential Revision: D28711177
Pulled By: H-Huang
fbshipit-source-id: 597ba578311b1f1d6706e487872db4e784c78c3c