We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.
The idea is:
1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
A kernel has "dispatcher convention" if there is an additional keyset
arg at the beginning of the argument list. This PR:
- adds a way to register kernels with dispatcher_convention using
Library.impl (pass dispatcher_convention = True)
- adds OpOverload.redispatch
We use both of the above in the new custom ops API: we register the
autograd kernel in dispatcher convention so that we can actually call
redispatch like how pytorch built-in ops do it.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089
Approved by: https://github.com/albanD
ghstack dependencies: #123937, #124064, #124065, #124066, #124071
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.
This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123578
Approved by: https://github.com/albanD
ghstack dependencies: #123453
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
When we enter map_autograd, we try to trace through fwd/bwd of a map operator that is wrapped in ctx.functionalize wrapper. This forces us to go through PreDispatch functionalization again (only the python part). As a result, it revealed our previous bug where pre-dispatch mode handling doesn't actually manage the local dispatch key set. (If there is no active mode, we need to turn off PreDispatch key). This PR fixes that. Also I shuffled some APIs around so that there is less code duplication as the setting/unsetting logic is quite hard to get it right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121444
Approved by: https://github.com/bdhirsh
In case no keyword arguments are passed, `**kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement:
```
In [1]: def null(*args, **kwargs):
...: pass
...:
In [2]: def call1(*args, **kwargs):
...: return null(*args, **(kwargs or {}))
...:
In [3]: def call2(*args, **kwargs):
...: return null(*args, **kwargs)
...:
In [4]: %timeit call1()
145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit call2()
118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [6]: %timeit call1()
147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [7]: %timeit call2()
117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880
Approved by: https://github.com/Skylion007
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:
1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.
2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.
3. We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.
Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
Summary:
This is important for writing aten IR based graph transformation.
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])
# === CANNOT CALL `self` BY KWARGS ===
In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
TypeError: OpOverload.__call__() got multiple values for argument 'self'
```
# Where's the problem?
1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.
So when call `self` by kwargs, `OpOverloadPacket.__call__` received:
```
OpOverloadPacket.__call__(self, {"self": ...})
```
It is Python that does not allow some argument named "arg" to appear twice. and hence
> TypeError: OpOverload.__call__() got multiple values for argument 'self'
# How to fix?
**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.
In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:
```
def __call__(_self, ...):
```
Now the call becomes:
```
OpOverloadPacket.__call__(_self, {"self": ...})
```
where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.
Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']
In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```
Differential Revision: D51731996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
Users may wish to torch.compile custom ops that mutate their inputs
and return nothing (this is a common class of operators).
torch.compile will automatically support this op without anyone needing
to provide a functionalization kernel for it. Here's how.
Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> ()
op. First, when FakeTensor sees this op, it can just return None.
This is the case because custom ops are not allowed to mutate input
metadata, so the FakeTensor rule for one that returns nothing is trivial.
Next, when Python FunctionalTensor sees the op, it will functionalize
it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...})
HOP and replacing the mutated inputs with the outputs of this HOP.
This HOP effectively runs the functional version of the op when
called: it clones inputs that will be mutated, runs the op, and
then returns Tensors with the new values.
In the future we can teach Inductor how to do re-inplacing when it sees
this HOP (like how triton kernels do it) but this isn't urgent (and is
more of a performance problem).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114955
Approved by: https://github.com/bdhirsh
This is a reland of #112757. Cannot land original one internally because internal diff is not in sync with OSS due to issues in dealing with two export repos (executorch and pytorch) using the ghimport-ghexport approach.
Will try the web UI of import and export instead of ghimport and ghexport flow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113813
Approved by: https://github.com/angelayi
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
bypass-github-export-checks
Test Plan: - existing tests
Differential Revision: D51080493
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182
Approved by: https://github.com/ezyang
Summary:
We've made the following changes:
- The new way to use the API is `m.impl_abstract_pystub(module, context)`.
Every subsequent m.def of an op inside the TORCH_LIBRARY block gives
the op the `impl_abstract_pystub`.
- Added a mechanism to determine if an operator was defined in Python or C++.
Library.define in Python appends the op to a global set, which is analogous
to what we do for tracking Library.impl.
- If someone does `torch.library.impl_abstract` in Python for an operator, then
we require that it has an `impl_abstract_pystub` specified and we also check
that the module in the `impl_abstract_pystub` is the same as the module where
the call to `torch.library.impl_abstract` exists.
- Unfortunately we can't check the "context" (which is the buck target on
buck-based systems) because buck sits above us.
Test Plan: - existing tests
Differential Revision: D50972148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851
Approved by: https://github.com/ezyang
Summary:
When raising an exception here this causes pybind11's dispatcher to kick in, which causes aiplatform's logic to kick in (aiplatform::error_reporting::util::printAddressesWithBestEffortLocationInfo), which ultimately uses `folly::symbolizer::Symbolizer::symbolize` for building up the stack trace. In 3.8 this uses about 3.62% of the CPU time per pyperf (https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy). In Cinder 3.8 for some reason this is worse - using 5.94% of the CPU.
This exception is happening when doing a hasattr() on `prims` for things like `bitwise_left_shift` which don't exist: https://www.internalfb.com/code/fbsource/[2d695f650d00]/fbcode/caffe2/torch/_inductor/lowering.py?lines=590
That exception is ultimately going to be swallowed anyway, and the stack trace has no meaningful value. Furthermore because this is kind of an expected outcome in the code versus some random C++ exception the stack trace is less valuable as well.
This changes this to return a (None, None) on the failure case instead of returning a valid op/overload list, avoiding the exception, and reclaiming the 3.62%-5.94% of time.
Test Plan: Existing CI and perf run: https://fburl.com/scuba/pyperf_experimental/on_demand/oi554uvy
Differential Revision: D50018789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111438
Approved by: https://github.com/davidberard98
Generally, to extend PyTorch with custom operators, a user will
create a Python module whose import triggers registration of
the custom operators via a torch.ops.load_library call or a call
to one or more torch.library.* APIs.
It is unexpected for Python modules to have side effects, so some
linters and formatters will complain. Use torch.ops.import_module to
import the module without a linter or formatter complaining.
NB: A more robust API would actually check if a custom op was registered
or modified, but this is technically challenging to do. In the future we
can add a warning if a custom op wasn't registered or modified.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110090
Approved by: https://github.com/ezyang
We now have two types of functionalization, C++ Functionalization (through the `Functionalize` dispatch key), and python functionalization (through the `FunctionalTensorMode` torch_dispatch mode).
This means that all higher order ops need custom functionalization rules for the python variant too. I added them here, as well as a helper function `dispatch_functionalize()` - equivalent to `torch.func.functionalize()`, except that it uses `FunctionalTensorMode`.
In theory we could have secretly switched `torch.func.functionalize` to use `FunctionalTensorMode`. This would be BC-breaking, though, since `FunctionalTensorMode` isn't composable with the other functorch transforms (the functorch layer-mode stack doesn't know how to re-order torch_dispatch modes arbitrarily).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108656
Approved by: https://github.com/zou3519
ghstack dependencies: #109024, #109248
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
Remove _deprecated_global_ns from cond following #104105.
We change the module attribute of HigherOrderOperator instances in the constructor from torch.ops to torch.ops.higher_order when self.namespace is "higher_order". For subclasses (e.g. customized higher order operator), we leave their \_\_module\_\_ unchanged.
Will import this PR to fix internal tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104380
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
Summary:
Trying to get the `__self__` attribute on any `_OpNamespace` object should be an invalid operation. The `__self__` attribute only exists on instance method object and not on class objects.
In [dynamo](a152b3e3b8/torch/_dynamo/variables/torch.py (L164)) there is code that tries to access the `__self__` attribute on `TorchVariable`, this currently results in an expensive call to `torch._C._jit_get_operation` [here](a152b3e3b8/torch/_ops.py (L740)) which ultimately fails and throws an exception. For cases where it fails the operation turns out to be quite expensive on the order of ~0.03s.
For edge use cases when exporting large models with quantized ops this exception is thrown 100's of times resulting in a lot of time wasted. By preventing the call to `torch._C._jit_get_operation` we can quickly return from this function and significantly reduce export times. On a large ASR model for example export currently takes **~405** seconds. With this change we can reduce it to **~340s**.
Overall this should also be a harmless change as no one should mostly ever try to access the `__self__` attribute on any `_OpNamespace` object.
Test Plan: Added test case.
Differential Revision: D46959879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104096
Approved by: https://github.com/larryliu0820, https://github.com/ezyang, https://github.com/zou3519
This PR changes the default namespace for higher order operators from the
global namespace (e.g. torch.ops.cond) to `higher_order` (e.g.
torch.ops.higher_order.cond). We don't actually change the namespace
for existing HigherOrderOperators.
The motivation is to stem the bleeding; exposing operators into the global
namespace is a bad idea due to name collision with other user-defined
namespaces.
We will go in and fix the `_deprecated_global_ns` as necessary after this diff.
Differential Revision: [D46809738](https://our.internmc.facebook.com/intern/diff/D46809738/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103870
Approved by: https://github.com/ydwu4