The bug was that: if you want to move a mode to the autograd key, we need to use the "functionality" key for it (AutogradFunctionality). But when we do that, we need to clear any PythonDispatcher caches for every op for **every** autograd key (since you could run autograd ops with both cpu and cuda tensors underneath the mode, which both may have been cached).
I didn't add a test, since this ends up getting indirectly tests by export in the PR. If someone would prefer a direct test I can add one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98030
Approved by: https://github.com/ezyang
This PR improves the list/tuple handling by merging the logic into
`wrap_with_proxy` directly, and set_meta when we find the current
proxy is a fx.Proxy. This also solves the problem that even `fused_adam`
have `val`, some corresponding `getitem` calls followed after `fused_adam` don't have val
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99897
Approved by: https://github.com/ezyang
This PR introduces CustomOp, a wrapper around a dispatcher operator that allows
users to define custom operators. It adds the skeleton for CustomOp and
some very simple behavior: as of this PR:
- one can create a CustomOp for an operator that does not have inplace or aliasing
- give it CPU/CUDA and Meta implementations
- and trace it into a graph via make_fx.
The design follows
https://docs.google.com/document/d/19Uc5OUCA187q9BZggJb70RT2ZoSTDoG5QQkJkZwd25M/edit
Concretely, we implement the following things mentioned in the doc in this PR:
- Entrypoint 1 (CustomOp.define, creating a new custom operator)
- impl (to define device-specific code) and impl_meta (to define meta
formulas)
The goal for the short term is to get the code to a state where it can be trialed
by the export folks. On top of this PR, the blockers are:
- adding Entrypoint 3 (CustomOp.from_existing)
- adding a way to do data-dependent shape formulas
These will come in future PRs since this one is getting long.
Things that will come in the longer-near-term (before 2.1):
- adding the other entrypoints mentioned in the doc (2 & 3)
- more safety checks and better error messages
- support for views and mutation
- support for defining autograd formulas
- support for functionalization
- making this API public (it's private right now).
Test Plan:
- added a new test case, TestCustomOp. It mostly tests a bunch of error
cases.
- added OpInfos for custom operators and hooked these up to
test_proxy_tensor to test that they work with make_fx. These custom
operators were based off of the ones in the autograd_function_db.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98440
Approved by: https://github.com/ezyang
The purpose of this API is to execute a few large components of work:
1) Refactor all the internals of plumbing dynamic dimension information after dynamo to be stateless
2) Decouple allocation controls around dynamic dimensions from verification
3) For (2), for allocation, create an enum that dictates whether we are in DUCK (default today), STATIC (aka assume_static_default in the past), or DYNAMIC (aka user constrained, do not duck shape)
4) For (2), for verification, we separate out the list of dynamic ranges entirely from allocation. This means shape_env does not tracking for what we verify on, and instead, it is the callers job to invoke produce_guards() with the various things they want verified, specifically, with the valid ranges. We do use constrain ranges to refine value ranges when doing analysis.
5) We have decided, therefore, as an extension of (4) to double down on "late" checks versus "eager" checks, primarily because the mechanisms for gathering what actually matters happens during guards, and should be a purview of the caller seeking guards, not the shape env. However, for dynamo, these structures are essentially one and the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96699
Approved by: https://github.com/avikchaudhuri, https://github.com/ezyang
This removes the need to explicitly constrain_unify `x[mask]` and `y[mask]` when mask is a boolean tensor. It's very narrow but it seems to work in practice.
To invalidate the nonzero call when mutation occurs, I use version counter. I know there are ways to bypass this but I think it's good enough for now.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95399
Approved by: https://github.com/eellison
This takes the strategy described in https://docs.google.com/document/d/1lFRYAJo5nrfxRhwIzGnfi2pbLpU6T4ytSRSuLJ5qebI/edit#
It is essentially https://github.com/pytorch/pytorch/pull/95222 but squashed and with changes that are unnecessary given that we assume nonzero returns > 1.
What's in the PR:
* nonzero now supports meta propagation. When `capture_dynamic_output_shape_ops`, it will return a tensor with an unbacked SymInt representing the size in question.
* The unbacked SymInt is UNSOUNDLY assumed to be not equal to 0/1. We will still error if you guard otherwise.
* PrimTorch pointwise operators are updated to use empty_permuted, to avoid guarding on unbacked SymInt from empty_strided (tested in `test_dynamic_pointwise_scalar`)
* Convolution is updated to skip backend selection if batch is unbacked, to avoid guarding on unbacked SymInt (tested in `test_unbacked_batch_resnet`)
* I kept the helper utilities like `definitely_true` for working with possibly unbacked SymInts. They're not used right now but maybe someone will find them useful.
* Added `constrain_unify` to let you specify two unbacked SymInts must have the same value
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95387
Approved by: https://github.com/voznesenskym