Commit Graph

234 Commits

Author SHA1 Message Date
PyTorch MergeBot
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
Edward Z. Yang
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00
PyTorch MergeBot
a7edf71360 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 8fae7027b3.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222
2022-08-25 00:49:40 +00:00
Edward Z. Yang
8fae7027b3 Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-23 22:04:07 +00:00
Max Podkorytov
68d2d7866d [static-runtime] change the backend for permute_copy (#83532)
Summary: Testing wrappable dims

Differential Revision: D38717563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83532
Approved by: https://github.com/mikeiovine
2022-08-17 18:10:36 +00:00
Kurt Mohler
2bfae07a79 Enable dim=None for torch.mean (#81286)
Part of #79525

This will require coordination with XLA before merging, just like #79881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81286
Approved by: https://github.com/albanD
2022-07-28 22:34:56 +00:00
Will Constable
4f34cd6d1e Replace all CHECK_ and DCHECK_ with TORCH_* macros (#82032)
Avoid exposing defines that conflict with google logging, since this blocks external usage of libtorch in certain cases.

All the 'interesting' changes should be in these two files, and the rest should just be mechanical changes via sed.
c10/util/logging_is_not_google_glog.h
c10/util/logging_is_google_glog.h

Fixes https://github.com/pytorch/pytorch/issues/81415

cc @miladm @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82032
Approved by: https://github.com/soumith, https://github.com/miladm
2022-07-26 01:20:44 +00:00
Kurt Mohler
23bdb570cf Reland: Enable dim=None for torch.sum (#79881)
Part of #29137

Reland of #75845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79881
Approved by: https://github.com/albanD, https://github.com/kulinseth
2022-07-09 00:54:42 +00:00
Max Podkorytov
bf75708ce4 [static-runtime] add nnc codegen for aten::div (#76903)
Differential Revision: D36151087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76903
Approved by: https://github.com/mikeiovine
2022-06-22 05:47:44 +00:00
Nikita Shulga
4a4890cfb2 [BE] Use CamelCase for enum class members (#79772)
Per many C++ code-style guides members(for [example](https://google.github.io/styleguide/cppguide.html#Enumerator_Names) ) members of `enum` should be CamelCased,
and only defines should be ALL_CAPS

Changes `MemOverlap`, `MemOverlapStatus` and `CmpEvalResult` enum values

Also, `YES`, `NO`, `TRUE` and `FALSE` are often system defines

Fixes among other things, current iOS build regression, see, which manifests as follows (see [this](6e90572bb9):
```
/Users/runner/work/pytorch/pytorch/aten/src/ATen/MemoryOverlap.h:19:29: error: expected identifier
enum class MemOverlap { NO, YES, TOO_HARD };
                            ^
/Applications/Xcode_12.4.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator14.4.sdk/usr/include/objc/objc.h:89:13: note: expanded from macro 'YES'
#define YES __objc_yes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79772
Approved by: https://github.com/drisspg, https://github.com/kulinseth
2022-06-17 05:53:57 +00:00
PyTorch MergeBot
ee6ebfc06b Revert "Enable dim=None for torch.sum (#75845)"
This reverts commit e79a51f7db.

Reverted https://github.com/pytorch/pytorch/pull/75845 on behalf of https://github.com/malfet due to Breaks MacOS builds, see e79a51f7db
2022-06-16 22:01:41 +00:00
Kurt Mohler
e79a51f7db Enable dim=None for torch.sum (#75845)
Part of #29137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75845
Approved by: https://github.com/ezyang
2022-06-16 20:17:07 +00:00
Michael Andreas Dagitses
606b234336 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 22:11:54 +00:00
PyTorch MergeBot
bcd7a20953 Revert "turn on -Werror=unused-function in our Bazel CPU build"
This reverts commit 67d313a032.

Reverted https://github.com/pytorch/pytorch/pull/79154 on behalf of https://github.com/malfet due to Breaks bazel build: 67d313a032
2022-06-10 20:43:03 +00:00
Michael Andreas Dagitses
67d313a032 turn on -Werror=unused-function in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79154

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-10 18:30:08 +00:00
Brian Hirsh
7b3a0ff87a Port index.Tensor to structured kernels.
Tracking issue: #55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69607

Approved by: https://github.com/bdhirsh
2022-06-10 17:27:47 +00:00
PyTorch MergeBot
4b82ef7928 Revert "Port index.Tensor to structured kernels."
This reverts commit cfd84125bd.

Reverted https://github.com/pytorch/pytorch/pull/69607 on behalf of https://github.com/zengk95 due to This is breaking mac trunk tests cfd84125bd
2022-06-08 20:16:10 +00:00
Brian Hirsh
cfd84125bd Port index.Tensor to structured kernels.
Tracking issue: #55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69607

Approved by: https://github.com/bdhirsh
2022-06-08 18:17:52 +00:00
Akshay Parashar
28f87b9cf9 [Static Runtime] Fix aten::clone out variant (#78297) (#78322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297

Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details.

Crashing test case:
a = tensor(3,1)			  // strides = (1,1)
B = tensor(3,2)	          // strides = (2,1)
Temp = a.expand_as(b).   // creates temp with shape as (3,2) and strides as (1,0)
temp.clone()		         // crashe on copy_ due to memoryOverlap

Fix: Disable the out variant for the expanded tensor.
- Calls native clone instead of out variant for clone dealing with expanded tensors
- Added test case for both clone variants (out and native clones)
- Increased the tensor size for memory planner test case to trigger dynamic allocation

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36672180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322
Approved by: https://github.com/mikeiovine
2022-06-02 21:06:59 +00:00
Max Podkorytov
ebfc70f37a [static-runtime] out variant for aten::mean (#78161)
Summary: As subject

Test Plan: Added unit tests

Differential Revision: D36614633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78161
Approved by: https://github.com/mikeiovine
2022-06-02 20:56:42 +00:00
PyTorch MergeBot
fca1f495c2 Revert "Port index.Tensor to structured kernels."
This reverts commit 9fe6f1baf5.

Reverted https://github.com/pytorch/pytorch/pull/69607 on behalf of https://github.com/suo due to this broke master, see: 9fe6f1baf5
2022-06-01 00:12:15 +00:00
Brian Hirsh
9fe6f1baf5 Port index.Tensor to structured kernels.
Tracking issue: #55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69607

Approved by: https://github.com/bdhirsh
2022-05-31 22:15:20 +00:00
Max Podkorytov
2679755bdc [static-runtime] out variant for aten::max (#78271)
Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim

Test Plan: Added a unit test

Differential Revision: D36656378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271
Approved by: https://github.com/mikeiovine
2022-05-26 23:29:27 +00:00
mikeiovine
56c23f5633 [SR] Out variant for embedding_bag_byte_unpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661

Add an out variant and wrapper in static runtime.

I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup.

Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)!

Approved by: https://github.com/tenpercent
2022-05-25 23:24:11 +00:00
Natalia Gimelshein
225b037df8 port clamp.Tensor to structured (#77149)
Per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77149
Approved by: https://github.com/ezyang
2022-05-11 21:00:02 +00:00
Max Podkorytov
a41d4f27d7 [static-runtime] refactor out variant for aten::embedding_bag (#76207)
Differential Revision: D35767504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76207
Approved by: https://github.com/mikeiovine
2022-05-11 18:29:18 +00:00
mikeiovine
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
Mike Iovine
9e32cdeda6 [SR] Use at::DimVector in reshape_copy_out (#76473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76473

Avoid some extra heap allocations by using DimVector
ghstack-source-id: 155569314

Test Plan: Existing unit tests

Reviewed By: navahgar, huiguoo

Differential Revision: D35972439

fbshipit-source-id: 971998d6bcaaf9bb598772f1e2ca6b13f29f92a4
(cherry picked from commit f2b70c38fffe6355cd8b2f0eb36f299c0d50e5d8)
2022-05-05 17:31:54 +00:00
Natalia Gimelshein
122798916f Port clamp_min and clamp_max to structured
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76874
Approved by: https://github.com/bdhirsh
2022-05-05 15:52:20 +00:00
Mike Iovine
cac2733af1 [SR] Codegen for aten::clamp (#76340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340

NNC kernel for `clamp` scalar case
ghstack-source-id: 155466507

Reviewed By: navahgar, huiguoo

Differential Revision: D35904019

fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f
(cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)
2022-05-04 23:08:49 +00:00
Ansha Yu
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
Yukio Siraichi
22a10ce513 Port cat kernel to structured kernels.
Tracking issue: #55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68640

Approved by: https://github.com/ezyang
2022-04-14 17:49:43 +00:00
Don Jang
85e163c56b [Static Runtime] Fix a bug that aten::full_like reuses a tensor that does not match arguments (#74255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255

This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.

Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.

Reviewed By: mikeiovine

Differential Revision: D34863639

fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
2022-04-05 22:30:41 +00:00
Raghavan Raman
60bda4d06b [Static Runtime] Fix handling relu in quantized linear relu dynamic op
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.

At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.

This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```

Reviewed By: mikeiovine

Differential Revision: D35366309

fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
2022-04-04 22:16:22 +00:00
Mike Iovine
688039859f [PyTorch][Static Runtime] out variant for where.self (#73438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73438

Add out variant for `where.self`; requires PyTorch core changes as no out variant existed previously
ghstack-source-id: 151505601

Test Plan:
* Existing `where` tests in static runtime pass
* CI for core `where` tests

Reviewed By: hlu1

Differential Revision: D34469785

fbshipit-source-id: 8a4ebbf38b2364534fbf43812bfcfdf69ea174b3
(cherry picked from commit d3faf61f408a385d67b5b821dfaf084a8e713f30)
2022-03-17 00:14:11 +00:00
Don Jang
6294a2eb7f [Static Runtime] Add out variant wrapper for aten::index_select (#74321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321

This change adds out variant wrapper for aten::index_select.

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34928012

fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283
(cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)
2022-03-16 23:43:21 +00:00
Mike Iovine
f14a0be302 [SR] Avoid allocating rstd/mean in layer_norm (#73606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606

The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116

Test Plan: Existing unit tests

Reviewed By: hlu1

Differential Revision: D34562131

fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
2022-03-15 22:07:11 +00:00
Don Jang
381c0c080f [Static Runtime] Fix a bug that aten::full reuses a tensor that does not match requested one (#73990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990

This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.

This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.

Test Plan: - Added a unittest.

Reviewed By: mikeiovine

Differential Revision: D34768718

fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
2022-03-15 16:13:52 +00:00
Don Jang
1b80f609b0 [Static Runtime] Add out variant wrapper for aten::ones_like (#73945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945

This change adds add out variant wrapper for aten::ones_like.

Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).

Reviewed By: hlu1

Differential Revision: D34727057

fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
2022-03-14 20:29:58 +00:00
Don Jang
60f22a40ef [Static Runtime] Add out variant wrapper for aten::zeros (#73946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946

This change adds an out variant wrapper for aten::zeros.

Test Plan:
- Added a unittest.

- Confirmed that the added out variant gets executed by the unittest (P485324923).

Reviewed By: mikeiovine

Differential Revision: D34725843

fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
2022-03-11 00:52:30 +00:00
Mike Iovine
97b20b9b50 [SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704

Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D34596954

fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
2022-03-09 00:55:57 +00:00
Don Jang
71961d37bb [Static Runtime] Add out variant wrapper for aten::ones (#73851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851

This change adds an out variant wrapper for aten::ones

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34557095

fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
2022-03-07 20:33:22 +00:00
Don Jang
c62de0ac15 [Static Runtime] [Code Cleanup] Use SROperator for operators' function type (#73450)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73450

This change uses `SROperator` for operators' function type

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D34483246

fbshipit-source-id: ed544bb91b676ed08983dc8dc78cedd0f77d499f
(cherry picked from commit eb9de3ad8de043990c02f30ffa48a29c8e5e81f2)
2022-03-01 02:30:48 +00:00
Mike Iovine
d398d4d32c [SR] Disable aten::where out variant (#73367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367

The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others:

```
def forward(self, cond_1d, x, y):
    shape = [-1] + [1] * (x.dim() - 1)
    cond = cond_1d.view(shape)
    return torch.where(cond, x, y).clone()

Condition:
01
00
[ CPUBoolType{2} ]

A:
06 -9
08 -8
[ CPULongType{2,2} ]

B:
-4 05
-5 -2
[ CPULongType{2,2} ]

Actual:
06 05
-5 -2
[ CPULongType{2,2} ]

Expected:
06 -9
-5 -2
[ CPULongType{2,2} ]
```
ghstack-source-id: 149963254

Test Plan: Unit tests exercise broadcasting

Reviewed By: d1jang

Differential Revision: D34454770

fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4
(cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)
2022-02-26 01:08:45 +00:00
Don Jang
5772b1afbc [Static Runtime] Avoid checks during op execution for TupleConstruct & ListConstruct (#69029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69029

This change optimizes the execution of `prim::TupleConstruct` & `prim::ListConstruct` by performing case analysis at op loading time, not op execution time.

Test Plan:
- Existing unittests

- Ran inline_cvr nets via ptvsc2_predictor_bench with compare_result=1

Reviewed By: swolchok

Differential Revision: D32518670

fbshipit-source-id: 575b29b06eadf77ba9f1be306119fa194d4f21bf
(cherry picked from commit 88cc2253b927267cad33063284e9cc66e0d31e2f)
2022-02-24 16:38:55 +00:00
Raghavan Raman
a7f9e610af [Static Runtime] Adding out-variant support for quantized::linear_relu_dynamic_fp16 (#73238)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73238

ghstack-source-id: 149706142

Test Plan:
Tested on video model.
Profile before:
```
4.36922 ms.    17.1047%. quantized::linear_relu_dynamic_fp16 (14 nodes)
```
Profile after:
```
3.80852 ms.    17.1074%. quantized::linear_relu_dynamic_fp16 (14 nodes, out variant)
```

Reviewed By: mikeiovine

Differential Revision: D34287961

fbshipit-source-id: b88e2f3432215eac14fd36f945a4810d29ba1051
(cherry picked from commit 076a766ab471c362af2f2ee3b55fe75829f5e955)
2022-02-23 18:33:46 +00:00
Mike Iovine
6f84c5f0b9 [SR] Generalize VarStackNodeWrapper (#71573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71573

Many ops (`gather_ranges_to_dense`, `sigrid_transforms`, etc) are implemented like this:

```
void op_out_(std::vector<Tensor>& output) {
  // actual op implementation
}

std::vector<Tensor> op() {
  std::vector<Tensor> outputs;
  // populate outputs with empty tensors
  op_out_(outputs)
  return outputs;
}
```

This pattern is not ideal for ops that are fused with `ListUnpack` - it would be better if we wrote to the outputs directly.

This diff extends the ideas from `VarStackNodeWrapper` to allow for this. The changes are:

* `s/VarStackNodeWrapper/ProcessedNodeInputWrapper`. The old name was bad because the class is more general than the `VarStack` use case. Also moved the class to `processed_node_wrapper.h`
* Add a `ProcessedNodeOutputWrapper`; it's essentially the same idea as `ProcessedNodeInputWrapper`, but it allows non-const access to the underlying tensors.
* These classes are very similar, so CRTP is used to facilitate code re-use
ghstack-source-id: 148825800

Test Plan: `buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Stack`

Reviewed By: swolchok

Differential Revision: D33687965

fbshipit-source-id: 5fa0107211116867bb2b63968c126550d32fbea6
(cherry picked from commit 75c263d960)
2022-02-10 19:43:47 +00:00
Scott Wolchok
958f9cf5ff [PyTorch][Static Runtime] Fix extra refcount bumps in layer_norm (#71237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71237

Noticed these on inspection.
ghstack-source-id: 147171799

Test Plan: CI

Reviewed By: mikeiovine

Differential Revision: D33519799

fbshipit-source-id: 167c63323b345a5822303cecdbbbbb959f66f6e4
(cherry picked from commit 57e8da2d35)
2022-01-20 00:16:17 +00:00
Scott Wolchok
bf82d2012e [PyTorch] Add IValue::toDimVector & mostly replace toIntVector with it (#71247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71247

Most uses of toIntVector() were for a Tensor shape. We have DimVector to avoid heap allocations in those cases, so let's use it.
ghstack-source-id: 146933314

Test Plan: CI -- if we think DimVector is good in general then I think we have to think this change is good?

Reviewed By: mikeiovine

Differential Revision: D33556198

fbshipit-source-id: cf2ad92c2d0b99ab1df4da0f6843e6ccb9a6320b
2022-01-14 14:32:40 -08:00
Elias Ellison
c8332256ee [JIT] Refactor SR invocation of fusion (#70508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70508

We can create the code object at compile time instead or runtime to speed it up. This also makes unnecessary the compilation cache. TODO: figure out if theres a way to cache InterpreterState object

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D33458648

Pulled By: eellison

fbshipit-source-id: 710389741e7c6210528f2f96ab496fcd533d942a
2022-01-10 12:16:35 -08:00