We want users to be able to define custom ops in C++ but put the
abstract impl in Python (since it is easier to write them in Python and
the abstract impl better models device semantics and data-dependent
operators).
`m.impl_abstract_pystub(opname, python_module, context)` declares the
abstract_impl of the operator to exist in the given python module.
When the abstract_impl needs to be accessed (either via FakeTensor or
Meta), and it does not exist, the PyTorch Dispatcher will yell
with a descriptive error message.
Some details:
- We construct a new global AbstractImplPyStub mapping in
Dispatcher.cpp. Read/write to this map is protected by the Dispatcher
lock.
- We add a new Meta Tensor fallback kernel. The fallback errors out if there is
no meta kernel, but also offers a nicer error message if we see that there is
a pystub.
- We create a `torch._utils_internal.throw_abstract_impl_not_imported_error`
helper function to throw errors. This way, we can throw different error
messages in OSS PyTorch vs internal PyTorch. To invoke this from C++, we
added a PyInterpreter::throw_abstract_impl_not_imported_error.
Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753/)
Differential Revision: [D49464753](https://our.internmc.facebook.com/intern/diff/D49464753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109529
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
# Motivate
Add XPU device type to CppFunction dispatch overload function.
We previously omitted it.
# Solution
Add XPU device type.
# Additional
This list is synchronized with the k-constants in c10/core/DeviceType.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96849
Approved by: https://github.com/ezyang
See strategy at PythonOpRegistrationTrampoline.cpp for the
big picture.
Along the way, I made OperatorHandle support == and hashing,
and slightly changed the low level python_dispatch impl API
to disallow empty strings for dispatch key, which had the knock
on effect of requiring us to explicitly make sure we pass in
CompositeImplicitAutograd if we would have passed in "" (I didn't apply
this to the rest of the file because I'm lazy.)
Test strategy is we delete the logic for preventing Python op
registrations in torch from being skipped in a torchdeploy context
and show CI still works.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162
Approved by: https://github.com/anjali411, https://github.com/bdhirsh
Summary:
This change causes Messenger Dekstop to crash on M1 devices when the user enables background during the call. The change apparently causes the compiler to emit AVX instructions that are not supported by Rosetta.
This is a surgical backout that only backs out the changes in C++ side,
and not Python bindings which I believe are not shipped with Workplace Chat.
Test Plan:
Run the application and make sure that it doesn't crash when the background is enabled
https://pxl.cl/23VSH
Reviewed By: ezyang
Differential Revision: D36358832
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77414
Approved by: https://github.com/bigfootjon
Summary: The new PrivateUse1 DeviceType is associated with the PrivateUse1 DispatchKey, which can be used for non-public devices without introducing a new device type. Note that the stringified name of the PrivateUse1 device is "privateuseone".
Test Plan: All CI should pass.
Differential Revision: D35859437
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77208
Approved by: https://github.com/bdhirsh
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68693
Generation of python bindings for native functions is split over 8
different files. One for each namespace, with the torch namespace
split into 3 shards, and methods in their own file as well. This
change ensures that editing any single (non-method) operator only
causes one of these files to be rebuilt.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D32596270
Pulled By: albanD
fbshipit-source-id: 0570ec69e7476b8f1bc21138ba18fe8f95ebbe3f
(cherry picked from commit ba0fc71a3a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63688
CppFunction is used for function registration, so it's not performance-sensitive. Outlining the destructor should reduce code size.
ghstack-source-id: 146648927
Test Plan: Mobile buildsizebot
Reviewed By: dhruvbird
Differential Revision: D30462640
fbshipit-source-id: de410f933bf936c16769a10a52092469007c8487
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67340
Currently Torchbind classes arent selective. This makes is a rough granularity pass that will remove entire classes if they arent selected. If we need finer granularity in the future we can make individual methods within classes Selective though instrumenting that will be significantly more involved I think. On a linux build only __torch__.torch.classes._nnapi.Compilation remains unselective. I cant find where its registered :P (theres a couple Android only ones and presumably some metal only ones as well)
Many of the classes registered in functions returned a reference to the class that was created. I talked with dreiss about it and we decided that this seemingly didnt serve any purpose, and leaving it like that would make the return value difficult (but possible) to create with selectivity. Since it seems useless anyway I just changed them to return an int so that they can still be called from a global scope, but not have any issues with the return type.
ghstack-source-id: 141690776
Test Plan: CI, model unit tests, test models in prod apps
Reviewed By: dhruvbird
Differential Revision: D31092564
fbshipit-source-id: 657f7eb83490292436c15cf134ceca9b72fb9e1a
Summary:
This PR implements the necessary hooks/stubs/enums/etc for complete ONNX Runtime (ORT) Eager Mode integration. The actual extension will live out of tree at https://github.com/pytorch/ort.
We have been [working on this at Microsoft](https://github.com/microsoft/onnxruntime-pytorch/tree/eager-ort/torch_onnxruntime) for the last few months, and are finally ready to contribute the PyTorch core changes upstream (nothing major or exciting, just the usual boilerplate for adding new backends).
The ORT backend will allow us to ferry [almost] all torch ops into granular ONNX kernels that ORT will eagerly execute against any devices it supports (therefore, we only need a single ORT backend from a PyTorch perspective).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58248
Reviewed By: astaff
Differential Revision: D30344992
Pulled By: albanD
fbshipit-source-id: 69082b32121246340d686e16653626114b7714b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62658
Boxed functors, like their unboxed brethren, support operators which
aren't just a function pointer, but a function pointer with some
associated global state that is allocated at registration time.
The use case I have in mind with this implementation is "dispatcher
API from Python", where the extra state kernel registrations need is
the PyObject callable we will invoke to do the actual invocation.
See next PR in this stack.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D30074925
Pulled By: ezyang
fbshipit-source-id: ee040edbbec1e607486d338d1ea78bb5c6b2ece9
Summary:
The function name and return type both are called `class_`, therefore they are ambiguous and this is UB and does not work on NVCC. See the tests for the failure case.
Thanks for the help of Thibaut Lutz from NVIDIA's compiler team.
cc: yueyericardo ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57962
Reviewed By: mruberry
Differential Revision: D28359400
Pulled By: ezyang
fbshipit-source-id: c64ec89203f99f656611aba34f7424eed7bc9e7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53143
Meta is now an honest to goodness device type, like cpu, so you can use
device='meta' to trigger allocation of meta tensors. This way better
than empty_meta since we now have working API for most factory functions
(they don't necessarily work yet, though, because need to register Meta
versions of those functions.)
Some subtleties:
- I decided to drop the concept of CPU versus CUDA meta tensors; meta
tensors are device agnostic. It's hard to say exactly what the
correct level of abstraction here is, but in this particular case
implementation considerations trump semantic considerations: it
is way easier to have just a meta device, than to have a meta device
AND a cpu device AND a cuda device. This may limit the applicability
of meta tensors for tracing models that do explicit cpu()/cuda()
conversions (unless, perhaps, we make those operations no-ops on meta
tensors).
- I noticed that the DeviceType uppercase strings are kind of weird.
Are they really supposed to be all caps? That's weird.
- I moved the Meta dispatch key to live with the rest of the "device"
dispatch keys.
- I intentionally did NOT add a Backend for Meta. For now, I'm going to
hope meta tensors never exercise any of the Backend conversion code;
even if it does, better to fix the code to just stop converting to and
from Backend.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: samestep
Differential Revision: D26763552
Pulled By: ezyang
fbshipit-source-id: 14633b6ca738e60b921db66a763155d01795480d
Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:
TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
...
}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634
Reviewed By: malfet
Differential Revision: D26614213
Pulled By: smessmer
fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52150
Renames "whitelist" to "allowlist" to conform to company use standards, prevent critical errors raised by linters which detect the old usage, and to move toward more self-descriptive terminology.
Test Plan: Sandcastle
Reviewed By: suo
Differential Revision: D26405520
fbshipit-source-id: 9c3a41591d4e29c0197de9a8f5858c9c29271e26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51315
The TODOs said to remove this wrapper, and it seems that it can be removed easily.
ghstack-source-id: 121363465
Test Plan: CI
Reviewed By: ezyang
Differential Revision: D26137147
fbshipit-source-id: f1e5971dca071f37400d77cc823214527e4231bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50457
The code to infer function schema from a C++ function relies on templates and code expansion. This uses valuable binary size. We can avoid inferring the schema from the C++ function type (arguments, name, return value) in case that the function implementation is being added to the dispatcher via `m.impl`. In this case, it is assumed that we have a schema registered already. Adding an implementation via `m.def` still triggers schema inferrence.
In addition, we don't do schema schema checks on mobile, so the schema is not needed in the first place.
ghstack-source-id: 120915259
Test Plan:
Auto-unit tests succeed.
### Size test: igios
```
D25853094-V1 (https://www.internalfb.com/intern/diff/D25853094/?dest_number=119632217)
igios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -21.8 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -45.5 KiB
Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:261049318687117@base/bsb:261049318687117@diff/
```
### Size test: fbios
```
D25853094-V1 (https://www.internalfb.com/intern/diff/D25853094/?dest_number=119632217)
fbios: Succeeded
Change in Download Size for arm64 + 3x assets variation: -27.2 KiB
Change in Uncompressed Size for arm64 + 3x assets variation: -80.1 KiB
Mbex Comparison: https://our.intern.facebook.com/intern/mbex/bsb:454289062251865@base/bsb:454289062251865@diff/
```
Reviewed By: smessmer
Differential Revision: D25853094
fbshipit-source-id: e138d9dff7561d424bfb732f3a5898466f018f60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49220
Since all ops are c10-full, we can remove .impl_UNBOXED now.
This also removes the ability of KernelFunction or CppFunction to store unboxedOnly kernels.
ghstack-source-id: 119450489
Test Plan: waitforsandcastle
Reviewed By: ezyang
Differential Revision: D25490225
fbshipit-source-id: 32de9d591e6a842fe18abc82541580647e9cfdad
Summary:
Since caffe2 and torch have been consolidated, CAFFE2_API should be merged with TORCH_API. Addresses a TODO.
Manually edited some references of the removed `CAFFE2_API`:
* `CONTRIBUTING.md`
* `caffe2/proto/CMakeLists.txt`
* `cmake/ProtoBuf.cmake`
* `c10/macros/Export.h`
* `torch/csrc/WindowsTorchApiMacro.h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49496
Reviewed By: malfet, samestep
Differential Revision: D25600726
Pulled By: janeyx99
fbshipit-source-id: 7e068d959e397ac183c097d7e9a9afeca5ddd782
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308
The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.
The only change from the previous commit is that I updated the regex like so:
before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`
I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.
Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.
Fixing regex pattern to allow for underscores at the beginning of the
namespace
This reverts commit 3c936ecd3c.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D25123295
Pulled By: bdhirsh
fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47322
Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API.
I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:
For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
m.def("opName", fn_name);
}
```
For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.
```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
m.def("opName(schema_inputs) -> schema_outputs");
m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}
// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:
I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`
There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.
There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D24714803
Pulled By: bdhirsh
fbshipit-source-id: c809aad8a698db3fd0d832f117f833e997b159e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42845
- In server builds, always allow the compiler to inline the kernel into the unboxing wrapper, i.e. optimize for perf.
- In mobile builds, never inline the kernel into the unboxing wrapper, i.e. optimize for binary size.
Note that this only applies for registration API calls where we can actually inline it, i.e. calls with `TORCH_FN` or some of the old API calls.
Registrations that give the registration API a runtime function pointer can't inline and won't do so on server either.
Note also that in server builds, all we do is **allow** the compiler to inline. We don't force inlining.
ghstack-source-id: 114177591
Test Plan:
waitforsandcastle
https://www.internalfb.com/intern/fblearner/details/225217260/
Reviewed By: ezyang
Differential Revision: D23045772
fbshipit-source-id: f74fd600eaa3f5cfdf0da47ea080801a03db7917
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43985
Added
```
def(detail::SelectiveStr<true>, ...)
impl(detail::SelectiveStr<true>, ...)
```
in torch/library, which can also be used for other templated selective registration.
Size saves for this diff:
fbios-pika: 78 KB
igios: 87 KB
Test Plan: Imported from OSS
Reviewed By: ljk53, smessmer
Differential Revision: D23459774
Pulled By: iseeyuan
fbshipit-source-id: 86d34cfe8e3f852602f203db06f23fa99af2c018
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401
This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen. See the Note inside for more
details.
This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis. It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all. (This
means we may be dropping dispatch keys where we weren't previously!)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: pbelevich
Differential Revision: D21905593
Pulled By: ezyang
fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b