Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
Per many C++ code-style guides members(for [example](https://google.github.io/styleguide/cppguide.html#Enumerator_Names) ) members of `enum` should be CamelCased,
and only defines should be ALL_CAPS
Changes `MemOverlap`, `MemOverlapStatus` and `CmpEvalResult` enum values
Also, `YES`, `NO`, `TRUE` and `FALSE` are often system defines
Fixes among other things, current iOS build regression, see, which manifests as follows (see [this](6e90572bb9):
```
/Users/runner/work/pytorch/pytorch/aten/src/ATen/MemoryOverlap.h:19:29: error: expected identifier
enum class MemOverlap { NO, YES, TOO_HARD };
^
/Applications/Xcode_12.4.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator14.4.sdk/usr/include/objc/objc.h:89:13: note: expanded from macro 'YES'
#define YES __objc_yes
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79772
Approved by: https://github.com/drisspg, https://github.com/kulinseth
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297
Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details.
Crashing test case:
a = tensor(3,1) // strides = (1,1)
B = tensor(3,2) // strides = (2,1)
Temp = a.expand_as(b). // creates temp with shape as (3,2) and strides as (1,0)
temp.clone() // crashe on copy_ due to memoryOverlap
Fix: Disable the out variant for the expanded tensor.
- Calls native clone instead of out variant for clone dealing with expanded tensors
- Added test case for both clone variants (out and native clones)
- Increased the tensor size for memory planner test case to trigger dynamic allocation
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Differential Revision: D36672180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322
Approved by: https://github.com/mikeiovine
Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim
Test Plan: Added a unit test
Differential Revision: D36656378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271
Approved by: https://github.com/mikeiovine
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473
Avoid some extra heap allocations by using DimVector
ghstack-source-id: 155569314
Test Plan: Existing unit tests
Reviewed By: navahgar, huiguoo
Differential Revision: D35972439
fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4
(cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993
Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}
More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7
The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.
Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`
Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`
Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
0.11156 ms. 99.0457%. aten::embedding_bag (52 nodes, out variant)
After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
0.0789221 ms. 98.7096%. static_runtime::embedding_bag (52 nodes, out variant)
* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/
Reviewed By: mikeiovine
Differential Revision: D35726594
fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255
This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.
Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.
Reviewed By: mikeiovine
Differential Revision: D34863639
fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.
At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.
This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```
Reviewed By: mikeiovine
Differential Revision: D35366309
fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606
The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116
Test Plan: Existing unit tests
Reviewed By: hlu1
Differential Revision: D34562131
fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990
This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.
This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.
Test Plan: - Added a unittest.
Reviewed By: mikeiovine
Differential Revision: D34768718
fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945
This change adds add out variant wrapper for aten::ones_like.
Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).
Reviewed By: hlu1
Differential Revision: D34727057
fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946
This change adds an out variant wrapper for aten::zeros.
Test Plan:
- Added a unittest.
- Confirmed that the added out variant gets executed by the unittest (P485324923).
Reviewed By: mikeiovine
Differential Revision: D34725843
fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704
Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721
Test Plan: New unit tests
Reviewed By: hlu1
Differential Revision: D34596954
fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851
This change adds an out variant wrapper for aten::ones
Test Plan: Added a unittest
Reviewed By: mikeiovine
Differential Revision: D34557095
fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450
This change uses `SROperator` for operators' function type
Test Plan: N/A
Reviewed By: mikeiovine
Differential Revision: D34483246
fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69029
This change optimizes the execution of `prim::TupleConstruct` & `prim::ListConstruct` by performing case analysis at op loading time, not op execution time.
Test Plan:
- Existing unittests
- Ran inline_cvr nets via ptvsc2_predictor_bench with compare_result=1
Reviewed By: swolchok
Differential Revision: D32518670
fbshipit-source-id: 575b29b06eadf77ba9f1be306119fa194d4f21bf
(cherry picked from commit 88cc2253b927267cad33063284e9cc66e0d31e2f)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71573
Many ops (`gather_ranges_to_dense`, `sigrid_transforms`, etc) are implemented like this:
```
void op_out_(std::vector<Tensor>& output) {
// actual op implementation
}
std::vector<Tensor> op() {
std::vector<Tensor> outputs;
// populate outputs with empty tensors
op_out_(outputs)
return outputs;
}
```
This pattern is not ideal for ops that are fused with `ListUnpack` - it would be better if we wrote to the outputs directly.
This diff extends the ideas from `VarStackNodeWrapper` to allow for this. The changes are:
* `s/VarStackNodeWrapper/ProcessedNodeInputWrapper`. The old name was bad because the class is more general than the `VarStack` use case. Also moved the class to `processed_node_wrapper.h`
* Add a `ProcessedNodeOutputWrapper`; it's essentially the same idea as `ProcessedNodeInputWrapper`, but it allows non-const access to the underlying tensors.
* These classes are very similar, so CRTP is used to facilitate code re-use
ghstack-source-id: 148825800
Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`
Reviewed By: swolchok
Differential Revision: D33687965
fbshipit-source-id: 5fa0107211116867bb2b63968c126550d32fbea6
(cherry picked from commit 75c263d960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247
Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314
Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?
Reviewed By: mikeiovine
Differential Revision: D33556198
fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70508
We can create the code object at compile time instead or runtime to speed it up. This also makes unnecessary the compilation cache. TODO: figure out if theres a way to cache InterpreterState object
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33458648
Pulled By: eellison
fbshipit-source-id: 710389741e7c6210528f2f96ab496fcd533d942a