This PR adds support for `SymInt`s in python. Namely,
* `THPVariable_size` now returns `sym_sizes()`
* python arg parser is modified to parse PyObjects into ints and `SymbolicIntNode`s
* pybind11 bindings for `SymbolicIntNode` are added, so size expressions can be traced
* a large number of tests added to demonstrate how to implement python symints.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78135
Approved by: https://github.com/ezyang
Change our representation of sizes and strides to contain SymInts
instead of int64_t.
Right now it's not actually possible to create a Tensor with symbolic
shape, so this change is intended to be a no-op.
But the intended behavior is:
- If you create a Tensor with symbolic shape, a `CustomSizes` policy
will be set, and the `has_symbolic_sizes_strides_` bit will be set. (not
currently implemented)
- Calling any TensorImpl function that naively interacts with sizes and
strides will throw. For hot-path functions (`sizes()`, `strides()`), we
make use of the existing policy check to throw. For others, we just have
a regular `TORCH_CHECK(!has_symbolic_sizes_strides_)`.
This also undoes the explicit constructor I made in
https://github.com/pytorch/pytorch/pull/77666; it ended up being more
annoying than useful when making these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78272
Approved by: https://github.com/Krovatkin, https://github.com/Chillee
Change our representation of sizes and strides to contain SymInts
instead of int64_t.
Right now it's not actually possible to create a Tensor with symbolic
shape, so this change is intended to be a no-op.
But the intended behavior is:
- If you create a Tensor with symbolic shape, a `CustomSizes` policy
will be set, and the `has_symbolic_sizes_strides_` bit will be set. (not
currently implemented)
- Calling any TensorImpl function that naively interacts with sizes and
strides will throw. For hot-path functions (`sizes()`, `strides()`), we
make use of the existing policy check to throw. For others, we just have
a regular `TORCH_CHECK(!has_symbolic_sizes_strides_)`.
This also undoes the explicit constructor I made in
https://github.com/pytorch/pytorch/pull/77666; it ended up being more
annoying than useful when making these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77994
Approved by: https://github.com/Krovatkin
Prior to this PR, we had a mish-mash of ways of getting unconventional
sizes/strides behavior:
- In OSS (but not in fbcode), some methods are virtual and you
can override them directly
- There is a is_contiguous policy which is a bitfield tag that lets
you toggle is_contiguous to error or hit a virtual method
is_contiguous_custom if it is set. Ordinarily is_contiguous()
is virtual and you can just override it, but this works EVEN IF
is_contiguous() is non-virtual (e.g., in fbcode)
- There is also a sizes policy which is the same idea but for sizes
This PR unifies these mechanisms, and in doing so, eliminates the
maybe virtual/not-virtualness of the methods in question. The primary
downside of this change is that it is BC-breaking (but the BC break is
very easy to fix!)
The new scheme works like this: we have three levels of policy for
sizes/strides (order matters).
- The Default policy is a conventional dense tensor, where we use
all of the built-in fields to directly represent the
sizes/strides/numel/contiguity of the tensor, and it is possible
to bypass virtual call entirely.
- The CustomStrides policy represent tensors which have a custom
notion of strides (most typically, that they don't support them),
shunting strides() and is_contiguous() to virtual methods
strides_custom() and is_contiguous_custom(). This INCLUDES handling
for contiguity, since they typically go hand-in-hand (although
the situation is murky with batched tensors). The default
implementations of these functions raise errors saying the tensor
doesn't support them.
- The CustomSizes policy represent tensors which have a custom
notion of sizes (the two notable examples are nested tensor, which
doesn't have a representation of sizes in the conventional form, and
XLA/LTC tensor, which synchronizes its sizes with an underlying
compiler backend). This shunts sizes(), numel() and dim() (along
with everything from strides) to _custom() variants.
There is no special policy for erroring; instead, we just do a vcall
and expect the virtual method to raise an exception (the performance
hit from the vcall doesn't matter because you're about to raise a C++
exception anyway). The default implementations of all overridable
functions are available at _default() which is helpful in some
situations when you just want to do a "sync" and then run the
conventional semantics.
This PR could be extended further in two ways but I did not do them
due to time constraints:
- Ideally, all TENSORIMPL_MAYBE_VIRTUAL would be eliminated from
TensorImpl, by using the same policy trick.
- set_size and set_stride are still virtual; it's not entirely clear
the same trick should be used here though as these methods are
deprecated.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77036
Approved by: https://github.com/bdhirsh
Whether or not this is a reasonable operation to do in the presence of
subclasses is a good question in and of itself, but this fixes an
obvious invariant violation, which is that if a Tensor reports that
it is a tensor subclass, it had better have the Python dispatch key.
Previously, the dispatch key would have gotten unconditionally cleared;
now we preserve what ever the original bit was.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75644
Approved by: https://github.com/albanD
The pattern of a PyObject* bundled with a PyInterpreter* is pretty
useful in many contexts (e.g., TorchDispatchTypeObject) so I have turned
it into a dedicated class SafePyObject. In the process I fixed a
bug with the old TorchDispatchTypeObject (copy constructor/assignment
was not deleted), made the API more safe (retrieving the PyObject*
pointer requires verification that the PyInterpreter* matches) and
fixed some minor inefficiencies in C++ code.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75142
Approved by: https://github.com/zou3519
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963
This is a re-land of D35192346 (9872a06d77) and D35192317 (a9216cde6c), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.
The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.
**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here cc23725e89/c10/core/DispatchKey.h (L382) and here cc23725e89/aten/src/ATen/core/dispatch/OperatorEntry.h (L214)), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: cc23725e89/aten/src/ATen/core/dispatch/Dispatcher.h (L294).
The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.
**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.
**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**
Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?
**The debugging experience was pretty difficult**
Debugging the Milan-specific failure was made difficult by the following:
(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.
(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)
(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542
Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.
Reviewed By: phding, albanD
Differential Revision: D35222806
fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74858
Original commit changeset: c7695e16dba3
Original Phabricator Diff: D34227615 (c0491c9179)
The Milan Assistant is crashing, see P489979944
After back out the change D34227615 (c0491c9179) and D34227616 (2cbddc0e9b), it works fine now
ghstack-source-id: 152380988
(Note: this ignores all push blocking failures!)
Test Plan:
Test on Milan with "get weather utterance"
buck build fbsource//fbandroid/mode/opt fbsource//fbandroid/mode/milan_build_rdk //fbandroid/apps/wearable/system/speechservice:speechservice_target30_xhdpi_armv7_release_debug_keystore -c pt.has_backtaces=1
Reviewed By: phding
Differential Revision: D35192317
fbshipit-source-id: e38081810a569b45ca037e019ec1c8773971534d
(cherry picked from commit 78833ac6997fbc8e20bd0f3ee0e0fe55a075054c)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73817
NestedTensorImpl doesn't have sizes(). Silently getting wrong results back from it is not conducive to efficient software development. Make it throw while allowing sizes() to be inlined in the common case anyway, just like is_contiguous(). Thanks ezyang for the reminder that we could do this.
ghstack-source-id: 151302903
Test Plan: Updated test_nestedtensor.py
Reviewed By: ezyang
Differential Revision: D34660829
fbshipit-source-id: 1289f21127d6a8359893f9174f3c430a290f2c7f
(cherry picked from commit 7098b9fcfbd25a03bac19e1148426ff073810edd)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/72623 that was reverted for the tls cleanup was removed.
From close inspection on the counting of the number of available keys, I think there is one more since the guard is actually one after the last usable key. With this update assert, the last updated key will still be <=63 which will fit just fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72832
Reviewed By: H-Huang
Differential Revision: D34228571
Pulled By: albanD
fbshipit-source-id: ce5e10a841ea87386727346cfc8d9327252574c4
(cherry picked from commit 59d3b86353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72402
The original PR had an array-out-of-bounds access in `DispatchKeyExtractor.cpp`, that wasn't caught by ASAN and appeared to only manifest in a subset of android internal tests. After fixing the OOB access (and adding more asserts), I confirmed that the android internal test passes.
Reland of D33255193 (20b8653dfa)
ghstack-source-id: 148830728
Test Plan:
Steps to test:
(1) connect to a mobile OD
(2) run `one_world android emulator android-29` in a terminal to start the android emulator
(3) In a separate terminal, run the test: `buck test //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test -c test.external_runner=tpx -- --regex 'testBIXRayModel.*PyTorchBIXRayInstrumentationTest' --force-remote-execution --run-disabled`
I also ran `buck test fbandroid/mode/dbg //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test`, which failed before and passed after the PR.
Reviewed By: albanD
Differential Revision: D34034848
fbshipit-source-id: 9677ee2c0a1afd1183896f7055009445712523c5
(cherry picked from commit 9ab9b12d35)
Summary: I think this diff stack broke all the related tasks below.
Test Plan:
For our failing tests:
buck test //fbandroid/instrumentation_tests/com/facebook/pytorch/bi_xray:instrumentation_test -c test.external_runner=tpx -- --regex 'testBIXRayModel.*PyTorchBIXRayInstrumentationTest' --force-remote-execution --run-disabled
For the ubn:
Not really sure what to do, trying to build the app and see if I can use an effect?
Reviewed By: shoumikhin
Differential Revision: D34018849
fbshipit-source-id: 3571718cb6621931af931b494e0a70d6e0164e65
(cherry picked from commit 3cc63cb2ea)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70364
A bunch of optimizations I made while staring at callgrind, after the DispatchKeySet changes further down in this stack.
There are basically three optimizations in this PR:
- Making `DispatchKeySet`'s constexpr (where previously they weren't)
- Condensing multiple keyset membership calls into a single function call
- Making `TensorImpl::layout()` fastpath. The common case it to return `kstrided`, but we were doing a bunch of checks before returning it in most cases.
Test Plan: Imported from OSS
Reviewed By: albanD
Differential Revision: D33301590
Pulled By: bdhirsh
fbshipit-source-id: 6ec28e66e7fe21f9decae317e8a4013dcf44e2fb
(cherry picked from commit 5defa1676e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68953
This PR consolidates the almost identical lvalue and rvalue implementations of shallow_copy_and_detach into a single templated function.
ghstack-source-id: 147238376
Test Plan: Run existing unit tests.
Reviewed By: fduwjj
Differential Revision: D32679741
fbshipit-source-id: 89a870335d2e09ffd005c943733a787d20d352f9
(cherry picked from commit 750344c860)