Commit Graph

344 Commits

Author SHA1 Message Date
Kazuaki Ishizaki
50bd252863 Fix typo the the (#110869)
This PR fixes typo `the the` of comments and exception message in files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110869
Approved by: https://github.com/soulitzer
2023-10-09 19:32:45 +00:00
Aaron Gokaslan
144cda7f06 [BE]: Enable ruff's flake8-PYI rules (#110830)
Enable Flake8-PYI rules codebase wide. Most of the rules already match our codebase style, the remaining ones that were not autofixed I have added to the pyproject.toml to be enabled in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110830
Approved by: https://github.com/albanD
2023-10-09 16:37:26 +00:00
Edward Z. Yang
6a974bec5d Change flash attention outputs to be SymInt instead of int (#110533)
Fixes https://github.com/pytorch/pytorch/issues/110322

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533
Approved by: https://github.com/albanD
2023-10-05 01:00:07 +00:00
Fabrice Pont
053367b1ed fix: flake8-bugbear code B024 (#107265)
See #106571 item B024

This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes.

Should I also include PEP8 compliant reformatting on the files I had to modify ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265
Approved by: https://github.com/kit1980
2023-10-04 23:52:52 +00:00
Mengwei Liu
0721a394b6 [executorch][kernel reg] Allow kernel manual registration (#110086)
Summary:
Exposing a codegen mode for generating a hook for user to register their kernels.

If we pass `--manual-registration` flag to `gen_executorch.py`, we will generate the following files:
1. RegisterKernels.h which declares a `register_all_kernels()` API inside `torch::executor` namespace.
2. RegisterKernelsEverything.cpp which implements `register_all_kernels()` by defining an array of generated kernels.

This way user can depend on the library declared by `executorch_generated_lib` macro (with `manual_registration=True`) and be able to include `RegisterKernels.h`. Then they can manually call `register_all_kernels()` instead of relying on C++ static initialization mechanism which is not available in some embedded systems.

Test Plan:
Rely on the unit test:

```
buck2 test fbcode//executorch/runtime/kernel/test:test_kernel_manual_registration
```

Reviewed By: cccclai

Differential Revision: D49439673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110086
Approved by: https://github.com/cccclai
2023-09-27 16:04:20 +00:00
Tarun Karuturi
a51b8df261 Add support for event_tracer in codegen layer (#109990)
Summary: Split out from D48975975, this handles the pytorch specific changes to add support for event_tracer in codegen layer.

Test Plan: CI

Reviewed By: dbort

Differential Revision: D49487710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109990
Approved by: https://github.com/Jack-Khuu
2023-09-27 09:09:03 +00:00
Brian Hirsh
63526a63f5 Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023)
I added some tests for Conj, Neg and ZeroTensor for both python and C++ functionalization. This also fixes a nasty segfult when running a functorch `jacfwd` test with `torch.compile`, once AOTAutograd is using `FunctionalTensor`.

Changes:

(1) I use Jeffrey's `make_wrapper_subclass(extra_dispatch_keys)` kwarg to plumb extra dispatch keys ontoto the wrapper, mirroring what C++ functionalization does (C++ functionalization will mirror all dispatch keys from the inner tensor to the wrapper, except for python and functorch keys).

(2) FunctionalTensorMode will decompose CompositeImplicitAutograd ops, since (for example) ZeroTensor kernels can send ops like `.to()` directly to the Python key. We'll need a way to toggle this later for pre-dispatch functionalization

(3) Bound `_ForceDispatchKeyGuard` and BatchedTensorImpl's dispatch keyset to python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109023
Approved by: https://github.com/zou3519
ghstack dependencies: #108654, #109662, #109632
2023-09-22 07:09:04 +00:00
eellison
067f172930 Serialize Remaining Patterns (#108917)
Serializes the remaining traced patterns.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917
Approved by: https://github.com/davidberard98
ghstack dependencies: #109663, #108894
2023-09-20 05:39:23 +00:00
eellison
16d608d70d Add Python serialization to Pattern Matcher patterns (#108894)
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.

Since there is a line limit for PRs  i'm only including the _sdpa_pattern1 in this first diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663
2023-09-20 05:36:52 +00:00
PyTorch MergeBot
8705fc1bbd Revert "Add Python serialization to Pattern Matcher patterns (#108894)"
This reverts commit 7db175b6f6.

Reverted https://github.com/pytorch/pytorch/pull/108894 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108894#issuecomment-1726649151))
2023-09-19 23:00:03 +00:00
PyTorch MergeBot
8b4b1817c8 Revert "Serialize Remaining Patterns (#108917)"
This reverts commit 7bf08b77f3.

Reverted https://github.com/pytorch/pytorch/pull/108917 on behalf of https://github.com/eellison due to land race ([comment](https://github.com/pytorch/pytorch/pull/108917#issuecomment-1726646267))
2023-09-19 22:54:52 +00:00
eellison
7bf08b77f3 Serialize Remaining Patterns (#108917)
Serializes the remaining traced patterns.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108917
Approved by: https://github.com/davidberard98
ghstack dependencies: #108894
2023-09-19 20:45:52 +00:00
eellison
7db175b6f6 Add Python serialization to Pattern Matcher patterns (#108894)
Adds a Python Pretty Printer to the pattern matcher that serializes patterns as python. Generating our fuse attention patterns was taking 4 seconds of compile time, which will only get worse as we add more variants (which I will do in the rest of this stack). To write out patterns, build pytorch, then run `gen_attention_patterns.py`.

Since there is a line limit for PRs  i'm only including the _sdpa_pattern1 in this first diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108894
Approved by: https://github.com/yanboliang
2023-09-19 20:36:52 +00:00
Brian Hirsh
f22b303f65 Add TorchDispatch version of functionalization (#106404)
This PR adds a new `FunctionalTensor` subclass, and `FunctionalTensorMode` torch dispatch mode. Together, this class/mode are a lightweight wrapper around our existing C++ functionalization logic.

This idea came from Ed - later in the stack, I want to be able to run functionalization **underneath** torch_dispatch, when performing tracing in AOTAutograd. I can't do this easily with vanilla C++ functionalization, because it has a dedicated dispatch key that always runs before TorchDispatch. However, by adding a torch_dispatch mode shim around functionalization, we can use functionalization as a torch_dispatch mode, which will make it easier to run underneath other modes later.

This PR provides the basic new classes, and some light testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106404
Approved by: https://github.com/ezyang
2023-09-15 20:19:25 +00:00
soulitzer
2bcff92540 Add NestedTensor python subclass (#108314)
Description coming soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108314
Approved by: https://github.com/jbschlosser
ghstack dependencies: #108808
2023-09-11 18:29:20 +00:00
Joel Schlosser
b928e08f3d Initial vmap + NT support with unbind fallback (#106786)
PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR:
* Allows `BatchedTensorImpl`s to contain NTs
* Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules
* Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT

Restrictions:
* Only supports one level of vmap
* Only supports vmapping over dim=0 for NTs
    * For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786
Approved by: https://github.com/zou3519
2023-09-07 13:53:20 +00:00
Brian Hirsh
da54f3c519 reorder proxy / fake modes so they always run last (#104482)
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:

(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.

I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).

`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.

`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".

`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).

There were also some mild codegen changes to support the new enum

(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`

The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.

I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.

(2) dispatching logic in `python_arg_parser.cpp`

This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode()  # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```

Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.

How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.

(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD  pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).

----- original description below -----

The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit

Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.

**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**

This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.

I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.

This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.

**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**

This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.

**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**

Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.

**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"

**(5) Fixed subclass fakeification**

@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.

**(6) make tensor subclasses resizeable**

Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.

**(7) Added a basic DoubleTensor subclass for testing**

I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
2023-08-29 02:36:48 +00:00
cyy
d9fb7166d6 [BE] use DeviceIndex instead of int64_t for related device interfaces (#103068)
This PR unifies the device interfaces in aten/*cpp and torch/csrc/*cpp to use  **c10::DeviceIndex**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068
Approved by: https://github.com/malfet
2023-08-25 20:16:14 +00:00
Masaki Kozuki
5814380e7b Revert "Revert "Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)""" (#106320)
Fixed a typo specifying the number of tensors and elements in the test having failed in slow gradcheck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106320
Approved by: https://github.com/soulitzer
2023-08-18 23:01:42 +00:00
Jun Luo
2abcfc40b0 Enable torchgen for MTIA dispatch key (#107046)
Summary: As title.

Test Plan: See diff D48258693

Differential Revision: D48273743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107046
Approved by: https://github.com/albanD
2023-08-15 07:56:18 +00:00
Tugsbayasgalan Manlaibaatar
20c5add133 [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-15 05:41:43 +00:00
Mengwei Liu
ddd2f682b9 [executorch] Let custom ops registration code only import ATen headers (#107064)
Summary: Basically we generate `CustomOpsNativeFunctions.h` for registering custom ops into PyTorch JIT runtime. This header needs to hookup with the C++ kernel implementation of all the custom ops. For this reason it should include ATen headers instead of Executorch headers. This PR changes it.

Test Plan: Rely on existing CI jobs

Differential Revision: D48282828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107064
Approved by: https://github.com/kirklandsign
2023-08-13 00:34:34 +00:00
David Watson
c9cdcb299a Remove ExclusivelyOwned from register_dispatch_key (#106791)
This fixes a bug that could occur with python decompositions.

When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode).

Note this will be a performance hit when handling errors.

Fixes #106790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791
Approved by: https://github.com/ezyang
2023-08-11 21:04:33 +00:00
PyTorch MergeBot
745d29b0cc Revert "[export] Refactor constrain_as_value and constrain_as_size (#106591)"
This reverts commit 18989890bf.

Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))
2023-08-11 16:37:47 +00:00
Tugsbayasgalan Manlaibaatar
18989890bf [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-11 05:29:22 +00:00
Alexander Pivovarov
02abbb8109 Fix some typos, mostly "that that" (#106901)
Fix some typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901
Approved by: https://github.com/janeyx99
2023-08-10 19:46:53 +00:00
PyTorch MergeBot
2b427ae3a7 Revert "Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)"
This reverts commit e773f28ee3.

Reverted https://github.com/pytorch/pytorch/pull/106043 on behalf of https://github.com/DanilBaibak due to Break slow tests ([comment](https://github.com/pytorch/pytorch/pull/106043#issuecomment-1658642734))
2023-07-31 15:50:36 +00:00
Aaron Gokaslan
52d4b1ae31 [BE]: Enable ruff rules PIE807 and PIE810 (#106218)
* Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218
Approved by: https://github.com/albanD
2023-07-28 22:35:56 +00:00
Masaki Kozuki
e773f28ee3 Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)
forward-mode AD of out-of-place foreach functions, finally.

rel:
- #102409
- #105504
- #58833
- #100695

---

# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachSinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  return result;
}

::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachNormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->ord = ord;
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  if (grad_fn) {
    grad_fn->result = result;
  }
  return result;
}

```

# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<SinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<NormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->p = p;
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "norm");
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106043
Approved by: https://github.com/soulitzer
2023-07-27 03:13:24 +00:00
Alan Ji
70b0f1b248 fix some typos (#106018)
Fixes #ISSUE_NUMBER
Fix typos in `test_static_module.cc`, `backend_cutting_test.cc` and `types_base.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106018
Approved by: https://github.com/awgu
2023-07-26 18:14:44 +00:00
Aaron Gokaslan
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
Justin Chu
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
Amadeusz Skrzypczak
b64bd4a5dd Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf

Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged

TODO:
 - Refactor duplicated code
 - Cleanup unbalanced pragma pop in dtype utils
 - Add native implementation on the CUDA size

Co-authored-by: Nikita Shulga <nshulga@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
2023-07-20 16:09:11 +00:00
PyTorch MergeBot
f2b15772ff Revert "Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)"
This reverts commit a9804130e5.

Reverted https://github.com/pytorch/pytorch/pull/104242 on behalf of https://github.com/PaliC due to breaks lint (run lintrunner and remerge) ([comment](https://github.com/pytorch/pytorch/pull/104242#issuecomment-1644150284))
2023-07-20 15:37:53 +00:00
Amadeusz Skrzypczak
a9804130e5 Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf

Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged

TODO:
 - Refactor duplicated code
 - Cleanup unbalanced pragma pop in dtype utils
 - Add native implementation on the CUDA size

Co-authored-by: Nikita Shulga <nshulga@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
2023-07-20 09:45:45 +00:00
Justin Chu
964d29f312 [BE] Enable ruff's UP rules and autoformat torchgen/ (#105423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105423
Approved by: https://github.com/Skylion007
2023-07-18 06:44:20 +00:00
Nikita Shulga
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
PyTorch MergeBot
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
Dave Bort
d06e1df1aa [torchgen] Rename executorch's RuntimeContext to KernelRuntimeContext (#104892)
Rename the context type to match changes in executorch.

Differential Revision: [D46977359](https://our.internmc.facebook.com/intern/diff/D46977359/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104892
Approved by: https://github.com/larryliu0820
2023-07-14 21:15:50 +00:00
Nikita Shulga
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
PyTorch MergeBot
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
Nikita Shulga
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
Jane Xu
038cb4075a Add capturable/maximize tests to Adam(W) optim configs (#104669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104669
Approved by: https://github.com/albanD
2023-07-10 17:38:46 +00:00
PyTorch MergeBot
8958f041be Revert "Add forward mode AD to out-place foreach functions (#102409)"
This reverts commit e2ec0ba404.

Reverted https://github.com/pytorch/pytorch/pull/102409 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it is failing some tests in trunk e799f565eb ([comment](https://github.com/pytorch/pytorch/pull/102409#issuecomment-1615254393))
2023-06-30 22:46:57 +00:00
Masaki Kozuki
e2ec0ba404 Add forward mode AD to out-place foreach functions (#102409)
The major difference from in-place support is that some out-place functions have their derivatives spelled out in derivatives.yaml, which requires some changes in `load_derivatives.py` and some handlings in various places due to the others whose derivatives are generated by `torchgen`.

rel:
- #58833
- #100695

---

# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachSinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  return result;
}

::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachNormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->ord = ord;
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  if (grad_fn) {
    grad_fn->result = result;
  }
  return result;
}

```

# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<SinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<NormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->p = p;
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "norm");
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102409
Approved by: https://github.com/soulitzer
2023-06-30 04:51:43 +00:00
Jack Khuu
18dacf7e79 [Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070)
Based on this [code search](https://fburl.com/code/gjcnw8ly) (*.yaml with `dispatch: CPU:`), update all files found to use

```
kernels:
    - arg_meta: None
      kernel_name:
```
instead of
```
dispatch:
    CPU:
```
---
## Code changes:

- `fbcode/executorch/codegen/tools/gen_oplist.py`
  - Strip ET specific fields prior to calling parse_native_yaml_struct
---
## Files edited that are not `*functions.yaml` or `custom_ops.yaml`

- fbcode/executorch/kernels/optimized/optimized.yaml
- fbcode/executorch/kernels/quantized/quantized.yaml
- fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml

---
## Found Files that were not edited

**Dispatched to more than just CPU**
- fbcode/caffe2/aten/src/ATen/native/native_functions.yaml
- xplat/caffe2/aten/src/ATen/native/native_functions.yaml
- xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml

**Grouped ops.yaml path**
- fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml

---
**Design Doc:** https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50

Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070
Approved by: https://github.com/larryliu0820
2023-06-27 09:53:20 +00:00
Driss Guessous
4a008d268a REDO of dropout support for mem eff #102038 (#103704)
THIS IS A new PR with the changes from #102038 + #103201 +  plus namespacing changes to fix bug.

# Summary
This PR builds off of:
- https://github.com/pytorch/pytorch/pull/101847
- https://github.com/pytorch/pytorch/pull/100583

It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made:
- Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention
- Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support
- Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704
Approved by: https://github.com/cpuhrsch
2023-06-26 23:05:03 +00:00
Jack Khuu
d1c367470b [Specialized Kernel] Remove requirement for type_alias and dim_order_alias to be present (#104006)
These fields are not required when kernels provided do not use aliases (e.g. only a default kernel

Differential Revision: [D46916099](https://our.internmc.facebook.com/intern/diff/D46916099/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104006
Approved by: https://github.com/larryliu0820
2023-06-23 16:49:57 +00:00
Mengwei Liu
ce845dfe49 [Reland][ET] Select used et_kernel_metadata only (#104005)
Summary: Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels.

Test Plan: contbuild & OSS CI

Reviewed By: Jack-Khuu

Differential Revision: D46882119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104005
Approved by: https://github.com/Jack-Khuu
2023-06-23 14:38:45 +00:00
PyTorch MergeBot
08a7d60a46 Revert "[Reland][ET] Select used et_kernel_metadata only (#103705)"
This reverts commit 59a01c49ee.

Reverted https://github.com/pytorch/pytorch/pull/103705 on behalf of https://github.com/osalpekar due to large number of internal failures in executorch contbuild. See [D46882119](https://www.internalfb.com/diff/D46882119) for more details ([comment](https://github.com/pytorch/pytorch/pull/103705#issuecomment-1601789900))
2023-06-21 22:51:38 +00:00