PoC demonstrating vmap + NT based on the [design doc](https://docs.google.com/document/d/1dVVk6TOqz93PLTIneU2T3xaxCs9qZ0MaJyCvOAp_bC0). This PR:
* Allows `BatchedTensorImpl`s to contain NTs
* Introduces a `BatchedNestedTensor` dispatch key for NT-specific batching rules
* Provides a batching rule fallback that unbinds the NTs -> performs computation on constituent -> rebinds results into NT
Restrictions:
* Only supports one level of vmap
* Only supports vmapping over dim=0 for NTs
* For operations with mixed NT / dense inputs, support is also limited to dim=0 for the dense inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106786
Approved by: https://github.com/zou3519
**Update:** Made refactor of the original PR. See the original description below, but here I'll describe the updates:
(1) TLS changes in `TorchDispatchModeTLS.h/cpp`.
I added a `TorchDispatchModeKey` enum, that (for now) just contains PROXY and FAKE. The ModeTLS used to just contain a `std::vector<std::shared_ptr<c10::SafePyObject>>` corresponding to the mode stack. It now **also** contains a separate array of "infra modes", indexed by mode key (PROXY and FAKE, with a new addition, FUNCTIONAL, coming later in the stack).
`TorchDispatchModeTLS::push_onto_stack` and `TorchDispatchModeTLS::pop_stack` are now a bit more complicated. Pushing accepts an optional mode_key, which if set, tells us to add the given mode directly to our "infra_modes" array. Popping will first check the "user mode" stack, before trying to pop anything from the infra mode stack. It also optionally returns the mode key of the mode we popped if there was one - that way if we push that same mode back onto the TLS later, we know where it goes.
`TorchDispatchModeTLS::dispatch_mode_enabled()` now accepts an optional `skip_infra_modes` param, so you can separately query if there are "any modes at all", or if there are "any user modes".
`TorchDispatchModeTLS::get/set/unset_mode()` all take in a mode key, and get/set/unset the mode at that particular mode key (meaning they are only meant to be used for infra modes).
There were also some mild codegen changes to support the new enum
(2) `fake_tensor.py/proxy_tensor.py/_python_dispatch.py`
The way I tell the infra that certain subclasses/modes are "infra" is through the enum: I gave `FakeTensor` and `FakeTensorMode` a `self._mode_key = torch._C.TorchDispatchModeKey.FAKE`. `TorchDispatchMode.__enter/exit__()` (in `_python_dispatch.py` now check if the current mode has a mode key, and if so they plumb it into any `push_onto_stack()` calls (which eventually instructs `TorchDispatchModeTLS` where to put the mode). Same thing for `ProxyTorchDispatchMode`.
I also had to change both of these mode's enter/exit, to handle the fact that there can no longer be multiple proxy/fake modes on the mode stack at once. I updated them both to have a `self.enter_stack: List[Optional[TorchDispatchMode]]` - whenever we push a given mode in `__enter__`, we remove the current ambient fake/proxy mode from the mode stack, and save it in `enter_stack`, so that on exit we can reset the state properly.
(2) dispatching logic in `python_arg_parser.cpp`
This is where the core dispatching logic changes are. I added two helpers, `dispatch_on_subclass()` and `dispatch_on_mode()`. The overall dispatching order is now:
```
(a) dispatch_on_mode() # try user modes first (where the mode stack automatically considers infra modes last)
(b) dispatch_on_subclass() # try user subclasses next (skipping infra subclasses)
(c) dispatch_on_subclass() # try infra subclasses next (skipping user subclasses)
```
Note that we still want "user subclasses" to run before "infra modes". As Ed helped me realize, this will work today: If proxy/fake modes in step 1, they'll return NotImplemented if they see a user subclass, allowing us to redispatch to the user subclass.
How do (b) and (c) distinguish between user and infra subclasses? Infra subclasses (FakeTensor, and later FunctionalTensor) are required to have a `_mode_key` hidden on the subclass - so we filter via arguments that do/don't have the _mode_key.
(3) I also changed `DoubleTensor` to `TwoTensor` to minimize confusion (@albanD pointed out that DoubleTensor would be easily confused with `torch.FloatTensor` and friends).
----- original description below -----
The main purpose of this PR is to fix the "ordering problem" between torch_dispatch modes, where we want to ensure that our Fake and Proxy dispatch modes always run **after** any dispatch modes created by the user, regardless of where they are in the stack. See this doc for more details: https://docs.google.com/document/d/1COQ291nOZvtFnzGTQMJqoYZ3sttEYFw_7HbfSyL8gcA/edit
Full set of changes below. I ended up including a few semi-related changes in this PR that I documented - but if folks would rather I separate them out, happy to try to do that.
**(1) Add dedicated TLS slots for FakeTensorMode and ProxyTensorMode**
This is the main component of this PR. There are two new slots, `TorchDispatchModeTLS.fake_mode_` and `TorchDispatchModeTLS.proxy_mode_`, which correspond to a single "global" fake and proxy mode. There is now an invariant that `torchDispatchModeState.stack_` can never contain either of these modes.
I also added a `TorchDispatchModeTLS::maybe_highest_mode()` helper that consults the `stack_` as well as both the proxy and fake slots, and returns the highest priority mode - this is because there are a few places in the codebase where we legitimately want to get the highest priority mode, *including* fake or proxy, if one is set.
This also made the implementations of the existing `disable_proxy_modes_tracing()` and `get_innermost_proxy_mode()` marginally simpler.
**(2) Updated the dispatching logic in handle_torch_function_no_python_arg_parser()**
This is the function that actually figures out which torch_dispatch implementation to call, given the current mode stack and tensor subclass inputs. This function got marginally more complicated as part of the refactor: First we inspect the mode stack and any non-fake subclass inputs. Then we check for the proxy mode slot. Then we check for the Fake mode slot, before finally checking for any fake subclass inputs.
**(3) new python `_get_fake_tensor_mode()` and `_get_proxy_tensor_mode()` API's**
Before, if you wanted to see if proxy or fake modes were active in python, you would have to consult the mode stack. Since these two modes are no longer part of the actual mode stack, I added two new API's to directly check if either proxy or fake modes are active.
**(4) Allow traceable tensor subclasses to access storages from python**
This is convenient later in the stack, where AOTAutograd needs to detect aliasing of inputs and outputs, where those inputs and outputs might be tensor subclasses. Previously, `x.untyped_storage()` would raise an error if `x` was a subclass. In this PR, I tried to relax this constraint as little as possible: `THPVariable_storage()` will only try to return a storage to python if the tensor subclass that you are passing in is "traceable"
**(5) Fixed subclass fakeification**
@wanchaol recently added support to be able to fakeify tensor subclasses. That fakeification logic works in most cases, but there is one case it doesn't handle: autograd metadata. In particular, since autograd sees our tensor subclasses and not their desugared tensors, we need to make sure that our fakeified subclass has the same autograd metadata as the original subclass. I updated `meta_utils.py` to make sure that the autograd metadata is correct.
**(6) make tensor subclasses resizeable**
Previously we didn't allow tensor subclasses to be resizeable. I ran into an issue where fakeifying a tensor subclass occasionally requires swapping out its storage, which can involve resizing the tensor. Mechanically, this required updating `at::for_blob()` to expose a way to request that the tensor that you create has resizeable storage, and then using this new API in `_make_wrapper_tensor()`.
**(7) Added a basic DoubleTensor subclass for testing**
I use this subclass more later in this stack in my AOTAutograd tests - but it serves as a simple subclass example to test the dispatch ordering in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104482
Approved by: https://github.com/ezyang
ghstack dependencies: #107415
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
Summary: Basically we generate `CustomOpsNativeFunctions.h` for registering custom ops into PyTorch JIT runtime. This header needs to hookup with the C++ kernel implementation of all the custom ops. For this reason it should include ATen headers instead of Executorch headers. This PR changes it.
Test Plan: Rely on existing CI jobs
Differential Revision: D48282828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107064
Approved by: https://github.com/kirklandsign
This fixes a bug that could occur with python decompositions.
When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode).
Note this will be a performance hit when handling errors.
Fixes#106790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791
Approved by: https://github.com/ezyang
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
* Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218
Approved by: https://github.com/albanD
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf
Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged
TODO:
- Refactor duplicated code
- Cleanup unbalanced pragma pop in dtype utils
- Add native implementation on the CUDA size
Co-authored-by: Nikita Shulga <nshulga@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf
Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged
TODO:
- Refactor duplicated code
- Cleanup unbalanced pragma pop in dtype utils
- Add native implementation on the CUDA size
Co-authored-by: Nikita Shulga <nshulga@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)
That were reverted due to the conflict with internal source repo.
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
- Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
- Add missing return statement to `torch._export. deserialize_graph`
- Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
- Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
- Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
Based on this [code search](https://fburl.com/code/gjcnw8ly) (*.yaml with `dispatch: CPU:`), update all files found to use
```
kernels:
- arg_meta: None
kernel_name:
```
instead of
```
dispatch:
CPU:
```
---
## Code changes:
- `fbcode/executorch/codegen/tools/gen_oplist.py`
- Strip ET specific fields prior to calling parse_native_yaml_struct
---
## Files edited that are not `*functions.yaml` or `custom_ops.yaml`
- fbcode/executorch/kernels/optimized/optimized.yaml
- fbcode/executorch/kernels/quantized/quantized.yaml
- fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml
---
## Found Files that were not edited
**Dispatched to more than just CPU**
- fbcode/caffe2/aten/src/ATen/native/native_functions.yaml
- xplat/caffe2/aten/src/ATen/native/native_functions.yaml
- xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml
**Grouped ops.yaml path**
- fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml
---
**Design Doc:** https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50
Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070
Approved by: https://github.com/larryliu0820
Summary: Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels.
Test Plan: contbuild & OSS CI
Reviewed By: Jack-Khuu
Differential Revision: D46882119
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104005
Approved by: https://github.com/Jack-Khuu
Fixes https://github.com/pytorch/pytorch/issues/103132
This is kind of annoying: Functionalization (and also vmap, I think?) manually figures out which ops have C++ CompositeImplicit decomps, and directly registers them to the Functionalize key. This is a problem for the PyDispatcher: We normally want the PyDispatcher to take precedence over the regular dispatcher. But in this case, we have a python decomp registered to `CompositeImplicitAutograd`, and a C++ decomp registered *directly* to the `Functionalize` key, so the C++ decomp gets precedence over the python decomp.
The way this showed up was that a model was running `matmul()` under inference mode, so we never hit the autograd dispatch key, and go straight to the functionalize dispatch key. Matmul has both a python decomp and a c++ decomp, but we were running the C++ decomp. That C++ decomp isn't meant to be used with dynamic shapes, so we were failing with the "tried to call `.sizes()` on a tensor with dynamic shapes" error.
For now, I had the PyDispatcher mimic the behavior of functionalization codegen: when you register a python decomp to the `CompositeImplicitAutograd` key, this PR just automatically registers that decomp to the `Functionalize` key at the same time.
I'm trying to remember now why we didn't just add `Functionalize` (and all of the other functorch transform keys) directly to the `CompositeImplicitAutograd` alias keyset, but I couldn't remember (@zou3519 any chance you remember?).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103275
Approved by: https://github.com/ezyang, https://github.com/zou3519
At high current implementation of constrains functions (constrain_as_**) will raise exception for the following code snippets:
```
def f(x):
a = x.item()
constrain_as_size(a, 4, 7)
return torch.empty((a, 4))
inp = torch.tensor([5])
ep = torch._export.export(f, (inp,))
```
The reason is because current constrain logic is:
1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op).
2) Utilize side effect to add range constraints for traced symbol's shape env ([code](9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372))).
3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round.
4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue.
5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive).
The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export.
**NOTE:**
[Logic](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)) within [`constrain_range`](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided:
* If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op.
* So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example:
```
...
inp = torch.tensor([10])
ep = torch._export.export(f, (inp,)) # immediately raise error
```
Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346
Approved by: https://github.com/tugsbayasgalan
Summary:
This API is used by the gen_executorch.py to check whether a kernel with specified kernel key is used or not.
Test Plan:
```
buck test xplat/caffe2/tools:test_torchgen_executorch
buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model
```
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103184
Approved by: https://github.com/larryliu0820
Summary:
keys and change codegen to take ETKernelIndex
We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change:
* `ETKernelKey` to retrieve a certain kernel from the registry.
* `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping.
Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys.
Test Plan: Added tests
Reviewed By: Jack-Khuu
Differential Revision: D46407096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102874
Approved by: https://github.com/Jack-Khuu, https://github.com/kirklandsign
keys and change codegen to take ETKernelIndex
We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change:
* `ETKernelKey` to retrieve a certain kernel from the registry.
* `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping.
Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys.
Differential Revision: [D46206339](https://our.internmc.facebook.com/intern/diff/D46206339/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102565
Approved by: https://github.com/Jack-Khuu
Summary:
For each op, we have a List[List[dtype;dim-order]]:
- the inner list contains the `dtype;dim-order` info for each arg if we have a Tensor/TensorList/OptionalTensorList
- the outer list contains different occurances of dtype/dim-order combinations for that op in the program
Example:
```
et_kernel_metadata:
aten::add.out:
# A list of different dtype/dim-order combinations used in model
- # Each contains the list of args of Tensor dtype and dim order if applicable
- FLOAT;0,1
- FLOAT;0,1
- NON_TENSOR_ARG
- FLOAT;0,1
- FLOAT;0,1
-
- INT;0,1
- INT;0,1
- NON_TENSOR_ARG
- INT;0,1
- INT;0,1
aten::mul.out:
- - FLOAT;0,1
- FLOAT;0,1
- FLOAT;0,1
- FLOAT;0,1
```
We don't have the arg name so far; we need to parse the schema (functions.yaml) to get that info. We depend on the order of args from that file.
Test Plan: `buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model`
Differential Revision: D45551409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100665
Approved by: https://github.com/larryliu0820
The problem:
- The new CustomOp API depends on torchgen.model
- torchgen.model imports `yaml`
- `yaml` is not a PyTorch runtime dependency
To unblock myself, because I'm not sure how long it'll take to
convince people yaml should be a PyTorch runtime dependency
(unless one of you wants to approve #100166), this PR removes the
yaml dependency from torchgen.model.
It does so by splitting torchgen.utils (the offender) into
torchgen.utils (no yaml) and torchgen.yaml (which uses yaml).
Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100203
Approved by: https://github.com/ezyang, https://github.com/Skylion007
This PR introduces a new operator called aten._assert_async.msg, which allows passing a tensor value and assertion message as inputs. As part of TorchDynamo, we're replacing the use of torch._assert with this new operator so that make_fx also knows how to handle assertions. This is subset of https://github.com/pytorch/pytorch/pull/98878, refer there for historic reviews.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100101
Approved by: https://github.com/jansel
I want to use torchgen to generate code, and my yaml file format is the same as `native_functions.yaml`.
I will use the PrivateUse1, but in my yaml file, I don't want to show PrivateUse1 to the user.
So I want to achieve the following result(e.g. my device is `YPU`):
```
>>>from torchgen.model import DispatchKey
>>>str(DispatchKey.PrivateUse1)
"YPU"
>>>DispatchKey.parse("YPU")
DispatchKey.PrivateUse1
```
I also thought that not everyone would need this feature, so I add a new func to handle this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99406
Approved by: https://github.com/ezyang
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor. Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.
One major piece of this PR is implementing the content hash for tensors. For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times). Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me. So instead, I implemented a crappy hash function that can be run on GPU. It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family. The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together. By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)
I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum. Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through. xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.
Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym
Enable some sensible flake8-simplify rules. Mainly wanted to enable the SIM101, and `yield from` SIM103 checks. @kit1980 since you wanted to be tagged on this CI check.
Enabling this check also helped flag one logical bug so it's definitely beneficial (also fixed in this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97984
Approved by: https://github.com/ezyang
Summary:
In ATen mode, we add the RuntimeContext arg, so we have something like
```
TORCH_API inline at::Tensor & gelu_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, c10::string_view approximate, at::Tensor & out) {
return at::gelu_outf(self, approximate, out);
}
```
and user can use `<namespace like aten>::gelu_outf` and we will automatically dispatch the registered function in aten kernel using `at::gelu_outf` (dispatched by ATen/Functions.h header)
In optimized kernel tests, we can now automatically handle between aten kernel and optimized kernel.
The implication is that the test must depend on the correctness of codegen; an error in codegen can break the kernel tests.
Test Plan: CI
Differential Revision: D43777848
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96084
Approved by: https://github.com/larryliu0820
follow-up https://github.com/pytorch/pytorch/pull/93901.
Unexpected numerical mismatches observed in some foreach functions' backward result seemed to be caused by the wrong order of `IndexRangeGenerator::range` call.
This pr has `args_with_derivatives` have the same or similar order of `foreach_native_function.func.arguments.flat_non_out`
---
what the current master generates for `_foreach_mul.List`:
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
std::lock_guard<std::mutex> lock(mutex_);
TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
IndexRangeGenerator gen;
auto other_ix = gen.range(other_size_);
auto self_ix = gen.range(self_size_);
variable_list grad_inputs(gen.size());
auto other = unpack_list(other_);
auto self = unpack_list(self_);
if (task_should_compute_output({ other_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
}
copy_range(grad_inputs, other_ix, grad_result);
}
if (task_should_compute_output({ self_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
}
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
```
with this PR the generated backward is
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
std::lock_guard<std::mutex> lock(mutex_);
TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
IndexRangeGenerator gen;
auto self_ix = gen.range(self_size_); <----- diff
auto other_ix = gen.range(other_size_); <----- diff
variable_list grad_inputs(gen.size());
auto self = unpack_list(self_);
auto other = unpack_list(other_);
if (task_should_compute_output({ other_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
}
copy_range(grad_inputs, other_ix, grad_result);
}
if (task_should_compute_output({ self_ix })) {
std::vector<Tensor> grad_result;
grad_result.reserve(grads.size());
for (const auto & i : c10::irange(grads.size())) {
grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
}
copy_range(grad_inputs, self_ix, grad_result);
}
return grad_inputs;
}
```
The change is to fix the order of `self_ix` and `other_ix`.[](url)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95263
Approved by: https://github.com/soulitzer
Changes:
1. Use class inheritance for `torch/return_types.pyi`:
Before:
```python
max = NamedTuple("max", [("values", Tensor), ("indices", Tensor)])
```
After:
```python
class max(NamedTuple):
values: Tensor
indices: Tensor
```
------
2. Add missing spaces in generated type annotations.
1. Always has a space after `,`.
2. If an argument is annotated, then there need spaces around `=` when it has a default value.
```diff
- def func(..., out: Optional[Tensor]=None, ...) -> Tensor:
+ def func(..., out: Optional[Tensor] = None, ...) -> Tensor:
```
3. If an argument is not annotated, then there should be no spaces around `=` when it has a default value.
```python
def contiguous(self, memory_format=torch.contiguous_format) -> Tensor: ...
```
------
3. ~Remove redundant import alias in `torch/nn/functional.pyi`:~ (Reverted)
UPDATE: `mypy` needs the alias to work.
Before:
```python
from .. import conv1d as conv1d
from .. import conv2d as conv2d
from .. import conv3d as conv3d
from .. import conv_transpose1d as conv_transpose1d
from .. import conv_transpose2d as conv_transpose2d
from .. import conv_transpose3d as conv_transpose3d
from .. import conv_tbc as conv_tbc
from .. import avg_pool1d as avg_pool1d
from .. import relu_ as relu_
from .. import selu_ as selu_
from .. import celu_ as celu_
from .. import rrelu_ as rrelu_
from .. import pixel_shuffle as pixel_shuffle
from .. import pixel_unshuffle as pixel_unshuffle
from .. import channel_shuffle as channel_shuffle
from .. import native_channel_shuffle as native_channel_shuffle
from .. import pdist as pdist
from .. import cosine_similarity as cosine_similarity
```
After:
```python
from .. import (
conv1d,
conv2d,
conv3d,
conv_transpose1d,
conv_transpose2d,
conv_transpose3d,
conv_tbc,
avg_pool1d,
relu_,
selu_,
celu_,
rrelu_,
pixel_shuffle,
pixel_unshuffle,
channel_shuffle,
native_channel_shuffle,
pdist,
cosine_similarity,
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95877
Approved by: https://github.com/ezyang
Fixes for PyTorch/XLA functionalization integration
---
Some notable changes include:
- More asserts in `FunctionalTensorWrapper`, so bugs show up more cleanly in cases where we e.g. forget to wrap an output
- Make the *_scatter ops `CompositeExplicitAutogradNonFunctional`, so we get a better error message and XLA doesn't accidentally try to us them
- Fix LTC/XLA codegen in core to handle multi-tensor out= ops with no returns
- Better erroring: Allow XLA to use the CPU fallback from core in a way so that it always errors on view ops, which XLA should no longer see.
- Update MetaConverter to exclude XLA tensors in raising NotImplemented…
- Add `_propagate_xla_data` op
- Add meta tensor support for some ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94537
Approved by: https://github.com/bdhirsh
Fixes#91694Fixes#92615
Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`.
After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.
After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded
Funny enough, backward tests were completely disabled before and were not passing:
```python
@unittest.skipIf(True, "Backward of lstm returns wrong result")
def test_lstm_2(self, device="mps", dtype=torch.float32):
```
UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR.
This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601
Approved by: https://github.com/ezyang
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
Changes:
1. `typing_extensions -> typing-extentions` in dependency. Use dash rather than underline to fit the [PEP 503: Normalized Names](https://peps.python.org/pep-0503/#normalized-names) convention.
```python
import re
def normalize(name):
return re.sub(r"[-_.]+", "-", name).lower()
```
2. Import `Literal`, `Protocal`, and `Final` from standard library as of Python 3.8+
3. Replace `Union[Literal[XXX], Literal[YYY]]` to `Literal[XXX, YYY]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94490
Approved by: https://github.com/ezyang, https://github.com/albanD
# Summary
This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml.
The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](33e0860c9c/flash_attn/flash_attn_interface.py (L126)) natively in PyTorch. One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions.
### MetaFunctions
I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below.
### Questions?
Performance issues vs mem efficient when using torch.nn.mha_forward
TorchCompile -> See purposed solution below.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917
Approved by: https://github.com/cpuhrsch
Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels
Test Plan: ci.
Differential Revision: D42177051
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194
Approved by: https://github.com/ezyang
Rely on CI.
Avoid issues such as:
```
Traceback (most recent call last):
File "<string>", line 38, in <module>
File "<string>", line 36, in __run
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 690, in <module>
main()
File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 626, in main
parsed_yaml, custom_ops_parsed_yaml = parse_yaml_files(
File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 505, in parse_yaml_files
translate_native_yaml(
File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 448, in translate_native_yaml
for e in native_es:
TypeError: 'NoneType' object is not iterable
```
Differential Revision: [D42729435](https://our.internmc.facebook.com/intern/diff/D42729435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92938
Approved by: https://github.com/JacobSzwejbka
As titled. To register a custom op into Executorch, we need:
* `custom_ops.yaml`, defines the operator schema and the corresponding native function.
* `custom_ops.cpp`, defines the kernel.
* `RegisterDispatchKeyCustomOps.cpp`, a template to register operator into PyTorch.
Added a new test for custom ops. The custom op `custom::add_3.out` takes 3 tensors and add them together. The test makes sure it is registered correctly and then verifies the outcome is correct.
Differential Revision: [D42204263](https://our.internmc.facebook.com/intern/diff/D42204263/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91291
Approved by: https://github.com/ezyang
It turns out that we *do* need to update *_scatter ops to return the exact same strides as their inputs. I added a test to `test/test_functionalization.py`, which now trips thanks to Ed's functionalization stride debugging check. It only actually ends up tripping silent correctness if you try to .backward() on that function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91029
Approved by: https://github.com/ezyang
## Job
Test running on most CI jobs.
## Test binary
* `test_main.cpp`: entry for gtest
* `test_operator_registration.cpp`: test cases for gtest
## Helper sources
* `operator_registry.h/cpp`: simple operator registry for testing purpose.
* `Evalue.h`: a boxed data type that wraps ATen types, for testing purpose.
* `selected_operators.yaml`: operators Executorch care about so far, we should cover all of them.
## Templates
* `NativeFunctions.h`: for generating headers for native functions. (not compiled in the test, since we will be using `libtorch`)
* `RegisterCodegenUnboxedKernels.cpp`: for registering boxed operators.
* `Functions.h`: for declaring operator C++ APIs. Generated `Functions.h` merely wraps `ATen/Functions.h`.
## Build files
* `CMakeLists.txt`: generate code to register ops.
* `build.sh`: driver file, to be called by CI job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89596
Approved by: https://github.com/ezyang
## Logic to handle custom ops
We generate files for custom ops, so that they can be registered into PyTorch.
Generated files:
* `Register{dispatch_key}CustomOps.cpp` (dispatch_key = CPU), it's basically the same as vanilla PyTorch `RegisterCPU.cpp`. The only difference is that we bind to native functions directly.
* `Register{dispatch_key}Stub.cpp` (dispatch_key = CPU), register placeholder kernels for custom ops. Only used when there's no custom op kernel available.
As an example:
```cpp
namespace {
at::Tensor & wrapper_out_unsqueeze_out(const at::Tensor & self, int64_t dim, at::Tensor & out) {
// No device check
// DeviceGuard omitted
return torch::executor::native::unsqueeze_out(self, dim, out);
}
} // anonymous namespace
TORCH_LIBRARY_IMPL(aten, CPU, m) {
m.impl("unsqueeze.out",
TORCH_FN(wrapper_out_unsqueeze_out));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90099
Approved by: https://github.com/ezyang
This PR adds `unboxing.py` which converts a `EValue` (similar to `IValue`) to its corresponding C++ type, based on the `ExecutorchCppSignature`.
Added unit tests to it in `test_executorch_unboxing.py`. Notice that this unboxing logic should work for both ATen types and Executorch types, hence the unit tests are parametrized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90098
Approved by: https://github.com/ezyang
This makes it easier to narrow down who is throwing the error,
instead of having to use gdb.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: [D42088781](https://our.internmc.facebook.com/intern/diff/D42088781)
Retry of #90591, which is a retry of #89595. Reverted due to dependency PR breaking internal fbcode.
## Forked BaseCppType
Created a module for Executorch: `torchgen.executorch`.
## In `torchgen.executorch.api.types.types`:
* Define `BaseCppType` with `torch::executor` namespace.
## In `torchgen.executorch.api.et_cpp`:
* Help generate `NamedCType` for `ExecutorchCppSignature` arguments.
## In `torchgen.executorch.api.types.signatures`:
* Define the signature using these types. (`ExecutorchCppSignature`)
## In `torchgen.executorch.api.types.__init__`:
* Suppress flake8 error for `import *`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90781
Approved by: https://github.com/ezyang
Retry of #90590, which is a retry of #89594. Original PR reverted due to internal breakage.
This PR fixes the breakage by adding a default value to the new argument.
This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90780
Approved by: https://github.com/ezyang
Retry of #89595. Accidentally closed.
## Forked `BaseCppType`
Created a module for Executorch: `torchgen.executorch`.
In `torchgen.executorch.api.types.types`:
* Define `BaseCppType` with `torch::executor` namespace.
In `torchgen.executorch.api.et_cpp`:
* Help generate `NamedCType` for `ExecutorchCppSignature` arguments.
In `torchgen.executorch.api.types.signatures`:
* Define the signature using these types. (`ExecutorchCppSignature`)
In `torchgen.executorch.api.types.__init__`:
* Suppress flake8 error for `import *`.
Differential Revision: [D41501836](https://our.internmc.facebook.com/intern/diff/D41501836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90591
Approved by: https://github.com/iseeyuan
A retry of #89487. Accidentally closed.
## Split `torchgen.api.types` into `types_base`, `types` and `signatures`.
In `types_base`:
* Created base class `CType`. `BaseCType` and `ConstRefCType` etc are inheriting `CType`.
* Only keep abstract type model definitions, such as `BaseCppType`.
In `types`:
* Define `BaseCppType` with `at` and `c10` namespaces.
* All the signatures using these types.
In `signatures`:
* Define all the signatures.
In `__init__`:
* `from ... import *`, suppress flake8 error.
Differential Revision: [D41455634](https://our.internmc.facebook.com/intern/diff/D41455634/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41455634/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90589
Approved by: https://github.com/iseeyuan
The idea is to add a custom handler to Functionalize key in Python
dispatcher that runs the functionalized version along side a non
functionalized version, and checks that their outputs agree in the
end. (Technically, for metadata mutation we should also check the
inputs, but for now we're relying on those functions returning self.)
I turned this on for test_functionalize.py (new TestCrossRefFunctionalize)
and found a bunch of failures that look legit.
This probably doesn't interact that nicely if you're also tracing at
the same time, probably need more special logic for that (directly,
just disabling tracing for when we create the nested fake tensor mode,
but IDK if there's a more principled way to organize this.)
There are some misc fixups which I can split if people really want.
- xfail_inherited_tests moved to test common_utils
- Bindings for _dispatch_tls_set_dispatch_key_included,
_dispatch_tls_is_dispatch_key_included and _functionalization_reapply_views_tls
- Type stubs for _enable_functionalization, _disable_functionalization
- all_known_overloads utility to let you iterate over all OpOverloads
in all namespaces. Iterator support on all torch._ops objects to let
you iterate over their members.
- suspend_functionalization lets you temporarily disable functionalization mode
in a context
- check_metadata_matches for easily comparing outputs of functions and see
if they match (TODO: there are a few copies of this logic, consolidate!)
- _fmt for easily printing the metadata of a tensor without its data
- _uncache_dispatch for removing a particular dispatch key from the cache,
so that we force it to regenerate
- check_significant_strides new kwarg only_cuda to let you also do stride
test even when inputs are not CUDA
- Functionalize in torch._C.DispatchKey
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89498
Approved by: https://github.com/malfet
# Summary
Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029
Approved by: https://github.com/cpuhrsch
This PR teaches PyDispatcher and PyOperator about functorch transforms.
It is important that PyDispatcher/PyOperator dispatch with functorch
transforms, because this is our plan for higher-order operators
(operators that accept functions as arguments). Examples of these
include:
- functorch transforms over the existing cond operator (control flow)
- autograd.Function support for functorch (which I am working towards),
- AOTDispatcher (should be a higher order operator)
Concretely, the problem with teaching PyDispatcher/PyOperator about
functorch is that the stack-based dispatching logic (DynamicLayerStack)
is hidden inside the fallbacks for two dispatch keys
(DynamicLayer{Front, Back}). PyDispatcher doesn't know about C++ boxed
fallbacks, our plan on record for that is that we need to reimplement
all of them in Python (but can call helper functions in C++ to make our
lives easier).
Instead of exposing all of what DynamicLayer{Front, Back} do to python,
this PR takes the approach of re-implementing part of the stack-based
dispatching in Python. The motivation is that this is more sane and
follows what the "ideal" implementation of functorch would have been:
- each transform should be a "mode"
- there should be no TLS dispatch key set hackery. functorch needs to do
this hackery today to re-use VariableType implementations.
This PR:
- exposes the DynamicLayerStack to Python
- The DynamicLayerStack is a stack of Interpreters.
These get exposed to Python as well.
- Interpreters can run operations (Interpreter.process) or lower them to
the next interpreter in the stack (Interpreter.lower)
- To use a PyOperator with functorch transforms, a developer needs to
register a rule for each transform (vmap, grad, jvp, ...).
- The PyOperator API is NOT user-facing. Things like autograd.Function
support for functorch will end up going through the autograd.Function
API.
Question for reviewers:
- Does this design make sense?
- I'm trying to split up the "functorch support for autograd.Function"
work into logical pieces. Would it be better if I didn't? (the full
thing is a bit long - 1000-2000 LOC).
Test Plan:
- new tests that construct PyOperator and compose them with functorch
transforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88785
Approved by: https://github.com/samdow, https://github.com/soulitzer
As part of the ongoing LTC migration effort, PyTorch/XLA is updating its codegen to use `xla::Shape` instead of `torch::lazy::Shape`. To achieve this, this PR updates the codegen to make the `GenLazyNativeFuncDefinition` generator customizable.
The existing `GenLazyNativeFuncDefinition` is kept by using the initial default values, so this change should not introduce any new behaviors to the existing codegen in PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87823
Approved by: https://github.com/alanwaketan, https://github.com/wconstab
Summary:
Sometimes we want to extend an existing custom namespace library, instead of creating a new one,
but we don't have a namespace config right now, so we hardcode some custom libraries defined
in pytorch today, i.e. quantized and quantized_decomposed
Test Plan:
ci
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88229
Approved by: https://github.com/ezyang
This fixes an issue with mobile: The output of view_copy ops should always be contiguous.
Later, we can consider adding optional arguments to the `view_copy()` functions to let you explicitly say what the contiguity of the output can be (e.g. channels_last)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85747
Approved by: https://github.com/ezyang
The context is that historically, XLA/LTC tensors haven't had accurate stride information, and functionalization would run "reference" meta kernels for view ops on the side to properly compute strides.
This is more complicated in symint tracing world - we have a `FunctionalTensorWrapper()` that wraps the underlying tensor and has its own set of sizes/strides metadata, but we never create proxy objects for the sizes/strides of the wrapper.
In symint tracing world with aot autograd, we're guaranteed that our underlying strides are accurate anyway, since aot autograd uses fake tensors to perform tracing. We encountered a few bugs with symint's from the `FunctionalTensorWrapper` making their way into `__torch_dispatch__`. To side-step that area of bugs completely (and marginally improve perf), this PR disables the meta tensor tracing for non XLA/LTC use cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87108
Approved by: https://github.com/ezyang, https://github.com/wconstab
Our prevailing strategy for symbolic shapes in C++ is to only
write the SymInt version of the code, and pay a slight performance
tax from not knowing if it is symbolic or not. However, there are
some fastpath functions where this tax is unacceptable, and we want
to specialize for the int case. Sometimes, it is easy to template
the function; but when the function involves Tensors, it is not,
because the functions you may want to call are not templated,
e.g., t.view vs t.view_symint
This PR adds an at::symint:: namespace which contains templated
functions for all functions in PyTorch which you can use in this
way. To show this works, I refactored sum_to to stop incorrectly
reinterpret casting and instead use a template. Instead of
t.sizes(), we call at::symint::sizes<T>(t), and so forth.
The template functions are SFINAE'd using a template argument that
is not otherwise used. As such, deduction is impossible. Typically, deduction
is hard anyway, because many of the constructors are ambiguous (this
is why we split foo and foo_symint in the first place). So you must pass
a template argument to these functions.
These functions are codegened into Functions.h so they are subject
to per-operator headers. This matters most for methods, which likely
didn't include the per-operator header, so you will have to add an
include in that case. We never generate method variants for these.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86329
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
Summary:
The test is causing issues:
```
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
graph(%A: Tensor, %driver: str?):
%bias: None = prim::Constant()
%ret = aten::linalg_svdvals(%A, %driver)
~~~~ <--- HERE
%cloned = aten::clone(%ret, %bias)
return (%cloned)
RuntimeError: torch.linalg.svd: keyword argument `driver=` is only supported on CUDA inputs with cuSOLVER backend.
```
Just block the op and re-run the codegen script to remove everything and update the generated ops.
Test Plan: Existing tests
Differential Revision: D39973860
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85983
Approved by: https://github.com/xuzhao9, https://github.com/tenpercent
Now, we also avoid translating SymInt to valueT if you haven't asked
for a SymInt implementation. This makes embedding_dense_backward
work without changes to LTC.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86043
Approved by: https://github.com/wconstab
- Make toIValue accept SymIntNode and SymFloatNode where number (aka Scalar) is
expected
- Binding for symintlistOptional in python arg parser
- Teach translate to convert from IntArrayRef to ArrayRef<int64_t>
- Don't query _symint function for meta info in LTC unless LTC is
code generating a symint function
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86042
Approved by: https://github.com/Chillee
Partially fixes: #66328
This PR:
- adds support for `ITensorList` to the dispatcher for:
- computing the dispatch key
- boxing and unboxing `ITensorList`
- modified the codegen for structured kernels:
- codegen APIs use `ITensorList` instead of `ArrayRef<Tensor>`
**Changes summary:**
- Signature changes due to the different APIs:
- dispatcher API (e.g. `BatchingRegistrations.cpp`)
- C++ API (e.g. `TensorShape.cpp`)
- Miscelaneous functions used by codegen'd functions (e.g. `FunctionalTensorWrapper.*`)
- Dispatcher changes for handling `ITensorList` correctly (e.g. `DispatchKeyExtractor.h`)
- Signature changes of `at::cat` due to the need of `const` inside `TensorBody.h`
- Forward declarations of `ITensorList` (e.g. `MethodOperators.h`)
- Codegen changes, special casing structured kernels (e.g. `gen.py`)
**Short description of structured kernels special casing:**
I introduced, mainly, 5 types of changes to the codegen for generating code depending on
whether the kernel is structured or not:
1. Added a `structured_type_override` flag to the `argument_type` function definition of
the affected APIs (mainly the dispatcher and C++ APIs).
- `api/cpp.py`, `api/dispatcher.py`, `api/native.py`
2. Added a `structured_type_override` member to the signature
classes (e.g. `CppSignature`), since `FunctionSchema` doesn't really know whether the
function is structured or not
- `api/types.py`
3. Added a `part_of_structured_group` to `NativeFunction` class, which is just a
convenient function to forward to `structured_type_override` wherever needed
- `model.py`
4. Appropriately changed the rest of the codegen, whenever it used either the signature
classes or the `arguments` function directly
5. Added a check for `const ITensorList&` type wherever there was a check for `TensorList`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73350
Approved by: https://github.com/bdhirsh
Instead of calling into the Python dispatcher for EVERY dispatcher
call, we now have a two step process. First, we
getattr(op: OpOverload, dispatch_key) to "load" the handler for the
function. This can either be a conventional function (in which
case we will call it, in the same way the old Python dispatcher
worked), or it can be a DispatchKey, in which case we will directly
call that DispatchKey in C++, bypassing marshalling between Python
and C++ entirely. OpOverload.__getattr__ is carefully written so
that it will cache the
A further optimization would be to define __slots__ on OpOverload,
and ensuring that the DispatchKey strings are interned.
The resulting Python dispatcher is less flexible: after the first
lookup, the handler is cached and we won't recompute it. Furthermore,
by default, dispatches will not go into Python, and so you won't
get stack frames for the Python dispatcher by default. But we get
a huge performance improvement: on the following microbenchmark
we go from 2.5s to 1.9s.
```
import time
import torch
from functorch import make_fx
def f(x):
for i in range(1000):
x = x * x
return x
begin = time.time()
res = make_fx(f, tracing_mode="symbolic")(torch.randn(10, 20))
print(time.time()-begin)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85133
Approved by: https://github.com/wconstab
Summary:
Like it says in the title. Currently, this will return output like this:
In Buck1, that's OK because Buck1's caching doesn't really care too much about
However, in Buck2, this is a disaster, because caching is based exclusively
on inputs and outputs and
The diff here proposes making the path relative to the codegen script itself,
which should carry about as much info, but avoid cache misses.
Concretely, this:
```
// generated from /dev/shm/uid-34135/cfbc5712-seed-nspid4026533424_cgpid2794673-ns-4026533443/tools/autograd/templates/python_functions.h
```
Becomes, this:
```
// generated from ../tools/autograd/templates/python_functions.h
```
So, we keep the useful part, and we get caching. This matters because those
headers are used in actions like:
```
fbcode//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops -- action (cxx_compile gen_embedding_backward_adam_split_unweighted_cuda.cu (pic))
```
Those actions take upwards of 5 minutes to finish, so by allowing a cache hit,
we are a) saving our users a lot of time and b) saving some RE capacity as
well.
This actually matters a lot because right now those targets are produced by
`//caffe2:generate-code`, which itself doesn't get cache hits from RE because
`generate_code.par` is non-deterministic (this is, unfortunately, true of PARs
in general), so that rule introduces non-determinism that the codegen
propagates and we get zero caching.
This diff doesn't fix `//caffe2:generate-code`'s inputs being
non-deterministic, but it does fix its *outputs* being non-deterministic, which
means the non-determinism stops there, and we get back to cache hits.
Test Plan:
- CI
```
buck2 build fbcode//caffe2:generate-code
buck2 build fbcode//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops
```
Reviewed By: ndmitchell
Differential Revision: D39348565
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84695
Approved by: https://github.com/soulitzer
Since we separated at::foo and at::foo_symint there is no benefit
to trying to make initializer lists work in both cases. So we can
get rid of the special different struct.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84837
Approved by: https://github.com/kit1980
Since we separated at::foo and at::foo_symint there is no benefit
to trying to make initializer lists work in both cases. So we can
get rid of the special different struct.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84837
Approved by: https://github.com/kit1980
This fixes two problems:
- First, shape signature didn't respect the symint property (so it
would always mark the operator as symint). This was relatively
easy to fix.
- Second, the call to fallback goes directly through at::_ops, so
it must always be SymInt-aware, even if SymInt is disabled externally.
This was a bit more difficult, because the current LTC codegen
is poorly factored. First, I needed to make it so individual
arguments knew if they were going to be SymInt in LTC or not; second,
I need to plumb enough information about the enclosing bindings so
that I could use translate to do the expressions (previously, it was
just assumed the signatures matched.)
The LTC codegen would do well to have a complete rewrite, but this will
have to do for now, I suppose.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84832
Approved by: https://github.com/wconstab
Summary: Context: currently `static_dispatch` assumes that given a native function `f`, we always want to map from its `DispatchSignature` to its `CppSignature`. This assumption may not hold true for some use cases, where the source bindings may not come from its `DispatchSignature`. Here I'm changing the argument `sig: DispatcherSignature` to be `sig: Union[CppSignature, DispatcherSignature]`, also removes unused `f`
Test Plan: Rely on added unit test.
Differential Revision: D39192969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84384
Approved by: https://github.com/iseeyuan
Something people found confusing was that whether or not a native::
signature would get SymInt or not in its type was based on the dispatch
key. This changes it so that SymInt or not in type is based on whether
or not you have _symint in the name of the kernel or not. This means
that even when we make operators support SymInt, you no longer have to
go and update all the preexisting definitions; instead, you now
selectively write _symint to opt individual kernels into SymInt support.
I then go and update a bunch of kernels that don't have proper SymInt
support to make use of this convention. There is some hacking around
for view generation code.
I also add support for external backends to specify 'symint' operators, for which we generate SymInt signatures instead of regular signatures.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: [D39310060](https://our.internmc.facebook.com/intern/diff/D39310060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84579
Approved by: https://github.com/wconstab
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"
Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9
Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159
Also update Metal registrations for C++ registration changes.
Also update NNPI registration to account for tightened schema checking
Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
Addresses `Wc++98-compat-extra-semi` warning from https://github.com/llvm/torch-mlir/issues/1264 by removing extraneous semicolon after autogen LTC native function definitions.
```
/home/runner/work/torch-mlir/torch-mlir/build/tools/torch-mlir/python/torch_mlir/csrc/base_lazy_backend/generated/LazyNativeFunctions.cpp:4241:6: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
};
^
```
cc: @wconstab @desertfire @ke1337 @antoniojkim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83955
Approved by: https://github.com/wconstab
Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD
Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
In order to avoid extra round trips, and avoid confusion in places such as
this to manually pull in the latest copy of the shape_functions.py file
This also fixes the cases where people pull in the wrong version of the file. This can happen in cases such as when developers run `python setup.py install` instead of `python setup.py develop` to generate their current copy of Pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83629
Approved by: https://github.com/davidberard98
Add is_symint_like, by way of is_base_ty_like which generalizes
the pattern for is_tensor_like and is_generator_like. Now that
we can query if a signature contains a SymInt, we can enforce that
you must name the overload with SymInt if the signature contains
SymInt.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83668
Approved by: https://github.com/bdhirsh, https://github.com/larryliu0820
- nondeterministic_seeded was not applied to enough functions. I added
some heuristics to codegen for identifying functions that are likely
to be random and added a bunch of these tags to functions. Not sure
I got all of them.
- Don't constant propagate through nondeterministic functions in FX
tracing.
It would be better to do some testing for the tag but this would be quite an effort.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83650
Approved by: https://github.com/bdhirsh, https://github.com/eellison
Summary:
We started to see use cases where it involves more than 1 custom namespace to live within the same yaml file. Hence relaxing the restriction that 1 yaml file can only have 1 custom namespace other than `aten`. Updated unit test as well.
Differential Revision: D38775685
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83580
Approved by: https://github.com/JacobSzwejbka
Extending the current regex in `model.py` to support annotation alias set. See issue #83214.
Ideally we should have a full fledged lexer similar to `schema_type_parser.cpp`, since regex can be more and more difficult to read if we add more support to it.
Adding this to unblock this issue for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83501
Approved by: https://github.com/SherlockNoMad
Summary:
Previously we don't generate out variant (both schema and kernel) for an operator with functional variant only. This adds support for that and adds test.
## Changes on `native_function_generation.py`
We are generating out variant for all functional variants if possible. This PR introduces a lot of newly generated out variants and `native_functions.yaml` needs to incorporate the changes by adding `autogen` keywords.
The logic for determining what operators we should generate an out variant for is the following:
1. No existing out variant for this `NativeFunction`
2. Contains an existing in place, mutable or functional variant
3. Contains at least 1 tensor like return(s)
For operators matching the first two conditions but failing the third, I listed them in `FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT`.
## Special handling
The following operators satisfy all 3 criteria above but we chose to not autogen them, with some reasons.
* `mkldnn_adaptive_avg_pool2d`, the generated out variant `mkldnn_adaptive_avg_pool2d.out` is colliding with the `mkldnn_adaptive_avg_pool2d_out` kernel in `adaptive_avg_pool2d.out` operator. I manually created `mkldnn_adaptive_avg_pool2d.out` and renamed `mkldnn_adaptive_avg_pool2d_out` to `mkldnn_adaptive_avg_pool2d_out_stub`.
* `min`, `max` and `mean`. There already exist `min.out`, `max.out` and `mean.out` but they are having different semantics with the functional ones. I manually created `min.unary_out`, `max.unary_out` and `mean.dtype_out` to disambiguate.
## Autograd Changes
We introduced a logic to not match derivatives info in `derivatives.yaml` to out variant, since we are generating `NOT_IMPLEMENTED` kernels for those out variants anyway. The issue we are seeing with the original logic is that it doesn't handle `TensorOption` arguments really well. For example we have these two operators:
* `_to_copy(Tensor self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, bool non_blocking=False, MemoryFormat? memory_format=None) -> Tensor`
* `_to_copy.out(Tensor self, *, bool non_blocking=False, MemoryFormat? memory_format=None, Tensor(a!) out) -> Tensor(a!)`
If we uses `_to_copy` derivative info, there will be compilation error since `dtype` is missing from `_to_copy.out` signature.
Test Plan: Rely on unit test
Differential Revision: D37832342
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81437
Approved by: https://github.com/iseeyuan, https://github.com/bdhirsh
Summary: Currently `SelectiveBuilder` is hardcoding namespace `aten` for operators. This is not working anymore since operators started to have custom namespaces. This fixes it.
Test Plan: Rely on newly added unit test
Differential Revision: D38565527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83141
Approved by: https://github.com/JacobSzwejbka
`derivatives.yaml` can now take a `dispatch` entry which registers per-autograd dispatch key derivatives such as
```
name: foo(Tensor self, Tensor y) -> Tensor
dispatch:
Default:
x: grad
y: grad.expand(y.sizes())
AutogradNestedTensor:
x: grad
y: NestedTensor_foo_backward(grad, y)
output_differentiabilty: [True]
```
However the old schema where there is no `dispatch` entry is still supported.
Would greatly appreciate feedback on *how to improve the testing strategy* of this PR, currently have registered an aten test op in TestOps.cpp with dummy gradients in derivatives.yaml and have some tests in test_autograd.py:TestAutogradMultipleDispatch but I am not sure whether these are sufficiently rigorous.
Additionally, this PR also makes the assumption that sets like [VIEW_FUNCTIONS](ff5399e528/tools/autograd/gen_inplace_or_view_type.py (L60)) are per-native-function and not per-native-function-and-dispatch-key. I'm not sure whether this is necessarily the case, *would there ever be a situation where (e.g. a nested_tensor op is a view op but the aten function is not or vice versa?)*
* __->__ #82801
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82801
Approved by: https://github.com/bhosmer, https://github.com/albanD
Somehow even with clang-format off, it was unhappy with this line
>>> Lint for torch/csrc/jit/runtime/serialized_shape_function_registry.cpp:
Warning (CLANGFORMAT) format
See https://clang.llvm.org/docs/ClangFormat.html.
Run `lintrunner -a` to apply this patch.
You can run `lintrunner -a` to apply this patch.
2855 2855 | return shape_mappings;
2856 2856 | }
2857 2857 |
2857 |-
2859 2858 | // clang-format on
2860 2859 |
2861 2860 | } // namespace jit
Note that there is no changes to `serialized_shape_function_registry.cpp` in this diff because I had to manually run `lintrunner` to make it format the code correctly in the previous diff so that we can land it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79571
Approved by: https://github.com/eellison
Differential Revision: [D38480514](https://our.internmc.facebook.com/intern/diff/D38480514/)
torchgen schema parser does not support parsing function schemas using custom class so far. Here is an example:
```
quantized::conv2d_relu.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)
```
This PR parse custom class name and encapsulate that into an object of CustomClassType. The only thing we need right now is just store the string class name and return that in `__str__` method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82925
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
The functional variant of one of the `arange` overloads has a schema mismatch with the out variant. The functional one has `Scalar step`, but the corresponding out variant has `Scalar step=1`. This isn't allowed, so it had to be special-cased in the python codegen and manually bound. This adds the default `step` value to the functional overload and removes the special-casing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81380
Approved by: https://github.com/ngimel
Fixes#81774
`TensorOptions` arguments in the JIT schema are optional, but in the Python API these were being translated to non-optional but with a default value. This change makes the arguments accept `None` for consistency with the JIT schema. However, it also means that `dtype=c10::nullopt` was previously completely untested so this also fixes several related bugs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82241
Approved by: https://github.com/ngimel
Summary:
A follow up of #81581. Before these 2 PRs, if an operator with custom kernel namespace is added to `native_functions.yaml` (or any other yaml consumed by `torchgen`), although we are able to recognize the custom kernel in files such as `NativeFunctions.h` and `RegisterCPU.cpp`, we still generate backend specific wrappers under the hardcoded `at` namespace. This changes the behavior, by generating wrapper functions under custom namespaces.
For example, if the entries in yaml file looks like:
```
- func: op_1(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: at::op_1_kernel # ATen kernel
- func: op_2(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: custom::op_2_kernel # custom kernel
```
We generate the following code for `CPUFunctions_inl.h` and `RegisterCPU.cpp`:
`CPUFunctions_inl.h`:
```
namespace at {
namespace cpu {
TORCH_API at::Tensor & op_1(const at::Tensor & self);
} // namespace cpu
} // namespace at
namespace custom {
namespace cpu {
TORCH_API at::Tensor & op_2(const at::Tensor & self);
} // namespace cpu
} // namespace custom
```
Notice the difference between `at::cpu` and `custom::cpu`.
Then the definition for these can be found in `RegisterCPU.cpp`.
`RegisterCPU.cpp`:
```
#include "CPUFunctions.h"
namespace at {
namespace {
at::Tensor & wrapper_op_1(const at::Tensor & self) {
// No device check
// DeviceGuard omitted
return at::native::op_1_kernel(self);
}
} // anonymous namespace
TORCH_LIBRARY_IMPL(aten, CPU, m) {
m.impl("op_1", TORCH_FN(wrapper_op_1));
}
namespace cpu {
at::Tensor & op_1(at::Tensor & self) {
return wrapper_op_1(self);
}
} // namespace cpu
} // namespace at
namespace custom {
namespace {
at::Tensor & wrapper_op_2(const at::Tensor & self) {
// No device check
// DeviceGuard omitted
return at::native::op_2_kernel(self);
}
} // anonymous namespace
TORCH_LIBRARY_IMPL(aten, CPU, m) {
m.impl("op_2", TORCH_FN(wrapper_op_2));
}
namespace cpu {
at::Tensor & op_2(at::Tensor & self) {
return wrapper_op_2(self);
}
} // namespace cpu
} // namespace custom
```
The benefit for this change is that it unifies all the namespaces derived from custom ops. In the example above, there are:
1. `custom::native` for kernels
2. `custom::<dispatch_key>` e.g., `custom::cpu` for wrappers
This customized operator will have nothing to do with `at::native`, `at::cpu` etc.
Test Plan: This is very hard to test. I will refactor this logic, abstract out some layers so it's testable. Will do it in coming PRs
Differential Revision: D37972772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81744
Approved by: https://github.com/bdhirsh
`resize_()` is annoying because it needs special casing for functionalization. It's technically an inplace-view op, but it can't really have a pure view variant, since calling resize_() might bust the old storage. I gave it an `inplace_view` tag so that stuff like `FakeTensor` that relies on tags will pick it up properly, which required jumping through some codegen hoops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82667
Approved by: https://github.com/eellison
Closes#82320
The iteration order of a `set` can change from run to run, resulting
in real content changes to generated files and therefore unnecessary
rebuilding.
The fix is to use a sort to give a repeatable iteration order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82536
Approved by: https://github.com/ezyang
Summary:
Some quantized operators needs `QuantizedCPU` backend, due to an issue in namespace checking, currently if we have two backends as well as a custom namespaces in native function, codegen will hit assertion error. This PR fixes this issue
The root cause is that codegen right now asserts that a native function should only have one namespace. The current behavior is that If a native function is not found in a `BackendIndex`, we will use default namespace for that backend, for fallback kernels. However that default namespace may not be listed in the yaml file and it should not be counted when checking if we have two different namespaces for that backend. In our error case, we have 2 `BackendIndex`, one for `QuantizedCPU` and one for `CPU`. The native function doesn't have a kernel in `QuantizedCPU` but we still use a default namespace (`at::native`) for it. Since we have a custom namespace for dispatch key `CPU`, we ran into the assertion error.
This PR changes the assertion criteria. We only error out if a namespace has two or more kernels and they have two or more different namespaces.
Test Plan: rely on newly added unit test
Differential Revision: D38101345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82133
Approved by: https://github.com/iseeyuan
Deprecated signatures are currently "parsed" manually to find the
relative order of the argument names and all other information is
inferred from the aten schema for the non-deprecated overload.
However, this leads to problems if the argument names don't match or
if there are multiple candidates that match the ATen function call.
Instead, this makes the deprecated function a full FunctionSchema and
so the entire python signature comes solely from the deprecated
schema, with the `aten:` clause only used for the dispatch lambda call.
I have confirmed locally that there is no change to
`python_torch_functionsEverything.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82179
Approved by: https://github.com/albanD
Done via
```
git grep -l 'SymbolicIntNode' | xargs sed -i 's/SymbolicIntNode/SymIntNodeImpl/g'
```
Reasoning for the change:
* Sym is shorter than Symbolic, and consistent with SymInt
* You usually will deal in shared_ptr<...>, so we're going to
reserve the shorter name (SymIntNode) for the shared pointer.
But I don't want to update the Python name, so afterwards I ran
```
git grep -l _C.SymIntNodeImpl | xargs sed -i 's/_C.SymIntNodeImpl/_C.SymIntNode/'
```
and manually fixed up the binding code
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82350
Approved by: https://github.com/Krovatkin
This also makes them not decompose when we switch Python key.
Note that CompositeExplicitAutogradNonFunctional maybe be overly
conservative for some implementations (which actually call into
other functional ops), but for now I just uniformly apply this
everywhere to avoid errors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82251
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
Motivation
- The initial motivation for the allowlist is that we were checking in
VmapGeneratedPlumbing.h to pytorch/functorch but people were changing
schemas of operators in pytorch/pytorch. The allowlist helped reduce the
number of collisions (because people change schemas of more operators
than we had in the allowlist). This is no longer a problem because
functorch is in the pytorch/pytorch repo
- Avoid merge conflicts. Multiple people editing the allowlist leads to
merge conflicts; getting rid of that alleviates it.
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82352
Approved by: https://github.com/ezyang
This PR changes VmapGeneratedPlumbing.h to be generated by torchgen. The
output file is ATen/VmapGeneratedPlumbing.h.
Why generate this file inside PyTorch codegen instead of a separate step
in functorch?
- I can't figure out how to get functorch's fbcode target to generate
- functorch's build system will, in the mid-term, be absorbed into
pytorch's build system, so I don't want to do the extra work of adding
a step to the functorch build process.
Test Plan:
- build pytorch, build functorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82351
Approved by: https://github.com/ezyang
Once CompositeImplicitAutograd gets registered to Python key, this will
ensure that tensor subclasses can interpose on these functions directly
rather than getting decomposed. We prefer not decomposing as these
functions are functional, but their implementations use inplace
operations (and are thus more difficult to deal with, unless you use
functionalization.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82238
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
This is to get a conversation started.
* @JackCaoG we could add attributes to items in `ir_codegen` section to customize IR generation logic (e.g. not generating `::Lower`). Though it could be a bit tricky to thread it through.
* Adding an extra argument to `map_codegen` to filter native functions out seems like a step in the right direction. Otherwise, it's a bit confusing how do we go from a full list to a codegen list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81847
Approved by: https://github.com/JackCaoG, https://github.com/wconstab, https://github.com/bdhirsh
Currently any function with a default dtype other than None has to be
manually entered into this function. Instead, this reads the default
directly from `native_functions.yaml`. In order to do this, I also
change `PythonSignatureGroup` to take `tensor_options_args` from the
functional variant since the out variant doesn't actually have tensor
options arguments to take the default values from.
Also note that we need to use `default_init` instead of `default`
because the out argument version doesn't have a `tensor_options`
argument to extract the default value from and so the PythonSignature
objects wouldn't match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81479
Approved by: https://github.com/albanD
The composite kernel for `view_copy` that we generate is special-cased a bit for efficiency to avoid having to do extra clones in some cases.
That logic was slightly wrong though, and is fixed here (it needs to mirror the logic in `reshape()`).
It manifested as a debug assert firing for Lazy Tensor, which I confirmed no longer fires when running this script:
```
# ran with "python test_ltc_only_torch.py --device=lazy --sync=1 --nvtx=1"
import torch
import torch._lazy
from torch._lazy.ts_backend import init as init_ts_backend
init_ts_backend()
torch.manual_seed(42)
from transformers import BertForSequenceClassification
def parse_args():
import argparse
parser = argparse.ArgumentParser(description='')
parser.add_argument('--device', type=str, default='cuda')
parser.add_argument('--sync', type=bool, default=False)
parser.add_argument('--nvtx', type=bool, default=False)
return parser.parse_args()
args = parse_args()
device = args.device
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)
from transformers import AdamW
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_batch = ["I love Pixar.", "I don't care for Pixar."]
encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
model = model.to(device)
model.train()
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)
labels = torch.tensor([1,0]).unsqueeze(0).to(device)
for _ in range(6):
torch.cuda.nvtx.range_push(f'Iter{_}')
torch.cuda.nvtx.range_push('F')
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
loss = outputs.loss
torch.cuda.nvtx.range_push('B')
optimizer.zero_grad()
loss.backward()
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_push('O')
optimizer.step()
if args.sync:
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
torch.cuda.nvtx.range_pop()
torch.cuda.nvtx.range_pop()
torch._lazy.mark_step()
torch._lazy.wait_device_ops()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81553
Approved by: https://github.com/ezyang
Summary:
In #77710 I introduces some hack to allow static dispatch to take namespaces. After we introduced namespace into ops and kernels, we don't have to pass namespace into `static_dispatch()`; instead we will generate ops with the kernel namespace for `Functions.h`. After this diff:
If we have a yaml file looking like this:
```
- func: op_1(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: at::op_1_kernel # ATen kernel
- func: op_2(Tensor(a) self) -> Tensor(a)
dispatch:
CPU: custom::op_2_kernel # custom kernel
```
`Functions.h` will contain the following C++ APIs:
```
TORCH_API inline at::Tensor & op_1(at::Tensor & self) {
return at::cpu::op_1_kernel(self);
}
TORCH_API inline at::Tensor & op_2(at::Tensor & self) {
return custom::cpu::op_2_kernel(self);
}
```
Test Plan: Rely on CI
Differential Revision: D37900753
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81581
Approved by: https://github.com/iseeyuan
This PR is doing a few interrelated things, all of which are necessary to get correctness. Read the comment in torch/fx/experimental/proxy_tensor.py for the high level overview.
Let's break down the parts of this PR:
* Bug fix where `enable_torch_dispatch_mode` with `None` doesn't work. This make `enable_torch_dispatch_mode(current_mode.inner)` work which is the basis for how we temporarily disable fake tensor mode.
* Bug fix for when fake tensor mode is combined with a non-mode tensor subclass. This actually could be ablated from this PR but it affects where the logic for allowing non fake tensor inputs with lift goes, so it's all in here in one go. There are some relevant tests for the fix in fake tensor, but it turns out I didn't need this because I'm always using proxy tensors as a mode (which ensures the ordering is right.)
* New `lift_fresh` view operator. Note that like lift, we have to manually write the functionalize kernel for these functions.
* The actual change, which is to save constants when we see them in the proxy tensor mode, and then propagate them as we go (because otherwise you'll handle mutations on constants incorrectly--see test.)
This is mildly BC-breaking if anyone was previously interposing on
at::lift, but this operator was relatively new and I checked
functorch which has no explicit reference to lift. So I think it
should not be too disruptive.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81192
Approved by: https://github.com/samdow, https://github.com/bdhirsh
There are small typos in:
- caffe2/python/recurrent.py
- test/distributed/test_c10d_nccl.py
- test/test_fx.py
- torch/csrc/jit/runtime/autodiff.cpp
- torchgen/gen.py
Fixes:
- Should read `propagation` rather than `propogation`.
- Should read `multiplied` rather than `multuplied`.
- Should read `eliminate` rather than `elminate`.
- Should read `dispatcher` rather than `disaptcher`.
Semi-automated pull request generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81435
Approved by: https://github.com/ngimel
Summary:
A followup to #78015 and #79733. In those PRs I introduced custom namespace support into:
* `Register<DispatchKey>.cpp`
* `RegisterSchema.cpp`
* `NativeFunctions.h`
This PR extracts out logic that generates schema registration code (used in `RegisterSchema.cpp`) into a function so that it can be easily tested and reused. Added unit test to cover the logic as well.
Test Plan: Rely on newly added unit tests.
Differential Revision: D37581186
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80780
Approved by: https://github.com/iseeyuan
`ProxyTorchDispatchMode` was added recently as part of `make_fx`, which was secretly causing the meta tensor calls used inside of functionalization to get baked into the graph. It also wasn't caught because the functionalization tests in core don't use `make_fx`, and the tests in functorch aren't as comprehensive.
Now that `make_fx` is in core, I also ported the functionalization test infra over to use it, which would have caught the regression. This also makes the tests cleaner, since mode-based tracing lets us pick up factory functions in the trace output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80416
Approved by: https://github.com/ezyang, https://github.com/albanD
This should fix the last issue that @anijain2305 hit when running ResNet with TorchDynamo <> functionalization.
Today if you try to call an `OpOverloadPacket` from python with some arguments, we will use the types of those arguments to perform overload resolution. With some functional variants of ops, this can be ambiguous.
Today this affects just one op: `_fused_moving_avg_obs_fq_helper`, although it would potentially affect e.g. `native_batch_norm` in the future.
Example:
```
# There are technically two overloads:
# torch.ops.aten._fused_moving_avg_obs_fq_helper.default (returns 2 argument, mutates 4 of its inputs inplace)
# torch.ops.aten._fused_moving_avg_obs_fq_helper.functional (returns 6 argument, mutates none of its inputs)
# We pick the wrong one - no way to know that we should pick the functional one, just from the call site.
outs = torch.ops.aten._fused_moving_avg_obs_fq_helper(a, a, a, a, a, a, a, 1.0, 0, 1, 0)
# raises an error - tries to call the overload with only 2 returns
return _fused_moving_avg_obs_fq_helper_functional[5]
```
Specifically, functionalization will bake `_fused_moving_avg_obs_fq_helper.functional` into the graph, but when AOTAutograd tries to compile with TorchScript, it needs to remove the overload name (TS doesn't know how to parse overload names directly, so we need to remove the overload name and let it infer the right overload at runtime later- so it picks the wrong one).
The situation is pretty similar to inplace; `ops.aten.add` and `ops.aten.add_` represent two different `OverloadPacket` objects; they can't be overloads of the same op, because their schemas would be ambiguous - the alias annotations are different, but that isn't enough to disambiguate).
In this PR, I try to fix the situation in a pretty similar way to how we handle `inplace` in the data model: `inplace` ops get their own base operator name, but they are represented as a flag inside of `BaseOperatorName` in the data model.
Two other important changes that I made as part of this PR:
(1) Originally, there were ~100 different `*_functional` operators: e.g. we had operators named `resize.functional` and `zero.functional`. The `_functional` bit isn't actually necessary in most cases: it's only necessary for operators that **also** have a `SchemaKind.mutable` variant, where `_fused_moving_avg_obs_fq_helper` is the only op that fits that description today. So I removed the unnecessary notion of "functional" from those other ops. I also added a bunch of assertions to force this restriction.
I think that makes more sense in the long run, because it eliminates an unnecessary difference in the model. E.g. we don't have `add_.Tensor` and `add.Tensor_functional`. We just have `add_.Tensor` and `add.Tensor`.
(2) I noticed that we actually still weren't pairing up a bunch of `_foreach` operators correctly, because their input arguments were different (`self` vs. `tensors`). Since they're private API's, I went ahead and changed the argument names directly so they get matched up. Before this PR, we were generating a separate `_foreach_add` and `_foreach_add.functional` variant in a bunch of cases, that really did the same thing (but happened to have a different name for the first argument).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80556
Approved by: https://github.com/ezyang, https://github.com/albanD