Meta tensor does a lot of work to make sure tensors "look" similar
to the original parts; e.g., if the original was a non-leaf, meta
converter ensures the meta tensor is a non-leaf too. Fake tensor
destroyed some of these properties when it wraps it in a FakeTensor.
This patch pushes the FakeTensor constructor into the meta converter
itself, so that we first create a fake tensor, and then we do various
convertibility bits to it to make it look right.
The two tricky bits:
- We need to have no_dispatch enabled when we allocate the initial meta
tensor, or fake tensor gets mad at us for making a meta fake tensor.
This necessitates the double-callback structure of the callback
arguments: the meta construction happens *inside* the function so
it is covered by no_dispatch
- I can't store tensors for the storages anymore, as that will result
in a leak. But we have untyped storage now, so I just store untyped
storages instead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87943
Approved by: https://github.com/eellison, https://github.com/albanD
python decomp for `native_group_norm` is correct in more cases than the C++ composite. Updating the tests to fail properly in this case was more annoying than just fixing the C++ decomp, so I fixed it here.
When the input tensor had a dtype with less precision than float32, the C++ decomp would unconditionally set the mean/variance to float32, which was wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86607
Approved by: https://github.com/albanD
Follow up:
- ~Remove non-float dtypes from allow-list for gradients~
- ~Map dtypes to short-hand so there aren't so many lines, i.e. float16 should be f16.~
- ~There were a lot of linting issues that flake8 wouldn't format for me, so I reformatted with black. This makes the diff a little trickier to parse.~
Observations:
- there are entries in the allow-list that weren't there before
- some forward that we previously passing now fail with requires_grad=True
- Because the allow list does not know about variants, a special skip was added for that in the block list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84242
Approved by: https://github.com/kulinseth, https://github.com/malfet
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"
Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9
Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159
Also update Metal registrations for C++ registration changes.
Also update NNPI registration to account for tightened schema checking
Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
Previously, we introduced new SymInt overloads for every function we wanted. This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.
This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.
This is BC-breaking in the following ways:
* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change. Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually. This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.
This is not BC-breaking in the following ways:
* The user facing C++ API remains compatible. Even if a function changes from int to SymInt, the default C++ binding still takes only ints. (e.g., at::empty(IntArrayRef, ...). To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.
Structure of the PR:
* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
* The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
* When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
* In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
* In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
Ref: #69991
Make `allclose` CompositeExplicit as it calls `item` (we can't get away from it) which makes it non Composite Compliant.
`linalg_eig` backward passes CompositeCompliance as it calls on `allclose`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82437
Approved by: https://github.com/zou3519
Implements linspace with arange, and logspace with linspace.
- Implements a more precise path in linspace's ref when dtype is integral to avoid off-by-one issues when output of computation is casted to int. The trade off is that there's an increased chance of overflow.
- Files several issues #82242, #82230, #81996, on preexisting issues with the linspace and logspace. These mainly concern when dtype is integral - the affect tests are xfailed in this PR.
- Fixes the check that the reference implementation is closer to precise implementation than torch implementation to also update the dtype kwarg to the precise dtype.
TODO:
- ~support negative bases~ (not in this PR)
- ~support complex. Since arange does not support complex, but linspace does, one solution is to just call linspace separately on the real and imag components and sum the results in the end~ (not in this PR)
- ~default dtypes need to be explicitly handled since computation is done in a different dtype than result~ (done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81826
Approved by: https://github.com/ngimel
Fixes#79512
This PR adds support for convolutional meta modules and computes the output shape correctly for some meta input tensor.
Currently in progress, no tests written so far.
**Feature implementations**:
- [x] `Conv1d`
- [x] `Conv2d`
- [x] `Conv3d`
**Tests**:
- [x] `Conv1d`
- [x] `Conv2d`
- [x] `Conv3d`
cc @albanD @anjali411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79834
Approved by: https://github.com/ezyang, https://github.com/albanD
Currently we have 2 ways of doing the same thing for torch dispatch and function modes:
`with push_torch_dispatch_mode(X)` or `with X.push(...)`
is now the equivalent of doing
`with X()`
This removes the first API (which is older and private so we don't need to go through a deprecation cycle)
There is some risk here that this might land race with a PR that uses the old API but in general it seems like most are using the `with X()` API or `enable_torch_dispatch_mode(X())` which isn't getting removed.
EDIT: left the `with X.push(...)` API since there were ~3 land races with that over the past day or so. But made it give a warning and ask users to use the other API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78215
Approved by: https://github.com/ezyang
Context: For a while slow gradcheck CI was skipping nearly all tests and this hid the fact that it should've been failing and timing out (10+h runtime for TestGradients). The CI configuration has since been fixed to correct this, revealing the test failures. This PR reenables slow gradcheck CI and makes it pass again.
This PR:
- makes slow and failing tests run in fast gradcheck mode only
- reduce the input size for slow gradcheck only for unary/binary ufuncs (alternatively, skip the test entirely)
- skip entire test files on slow gradcheck runner if they don't use gradcheck (test_ops, test_meta, test_decomp, test_ops_jit)
- reduces the input size for some ops
Follow ups:
1. Investigate slow mode failures https://github.com/pytorch/pytorch/issues/80411
2. See if we can re-enable slow gradcheck tests for some of the slow tests by reducing the sizes of their inputs
The following are failing in slow mode, they are now running in fast mode only.
```
test_fn_fwgrad_bwgrad___rmod___cuda_float64
test_fn_fwgrad_bwgrad_linalg_householder_product_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_complex128
test_fn_fwgrad_bwgrad__masked_prod_cuda_float64
test_fn_fwgrad_bwgrad_linalg_matrix_power_cuda_complex128
test_fn_fwgrad_bwgrad_cat_cuda_complex128
test_fn_fwgrad_bwgrad_linalg_lu_factor_ex_cuda_float64
test_fn_fwgrad_bwgrad_copysign_cuda_float64
test_fn_fwgrad_bwgrad_cholesky_inverse_cuda_complex128
test_fn_fwgrad_bwgrad_float_power_cuda_complex128
test_fn_fwgrad_bwgrad_fmod_cuda_float64
test_fn_fwgrad_bwgrad_float_power_cuda_float64
test_fn_fwgrad_bwgrad_linalg_lu_cuda_float64
test_fn_fwgrad_bwgrad_remainder_cuda_float64
test_fn_fwgrad_bwgrad_repeat_cuda_complex128
test_fn_fwgrad_bwgrad_prod_cuda_complex128
test_fn_fwgrad_bwgrad_slice_scatter_cuda_float64
test_fn_fwgrad_bwgrad_tile_cuda_complex128
test_fn_fwgrad_bwgrad_pow_cuda_float64
test_fn_fwgrad_bwgrad_pow_cuda_complex128
test_fn_fwgrad_bwgrad_fft_*
test_fn_fwgrad_bwgrad_zero__cuda_complex128
test_fn_gradgrad_linalg_lu_factor_cuda_float64
test_fn_grad_div_trunc_rounding_cuda_float64
test_fn_grad_div_floor_rounding_cuda_float64
```
Marks the OpInfos for the following ops that run slowly in slow gradcheck as `fast_gradcheck` only (the left column represents runtime in seconds):
```
0 918.722 test_fn_fwgrad_bwgrad_nn_functional_conv_transpose3d_cuda_float64
1 795.042 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_complex128
2 583.63 test_fn_fwgrad_bwgrad_nn_functional_max_pool3d_cuda_float64
3 516.946 test_fn_fwgrad_bwgrad_svd_cuda_complex128
4 503.179 test_fn_fwgrad_bwgrad_linalg_svd_cuda_complex128
5 460.985 test_fn_fwgrad_bwgrad_linalg_lu_cuda_complex128
6 401.04 test_fn_fwgrad_bwgrad_linalg_lstsq_grad_oriented_cuda_complex128
7 353.671 test_fn_fwgrad_bwgrad_nn_functional_max_pool2d_cuda_float64
8 321.903 test_fn_fwgrad_bwgrad_nn_functional_gaussian_nll_loss_cuda_float64
9 307.951 test_fn_fwgrad_bwgrad_stft_cuda_complex128
10 266.104 test_fn_fwgrad_bwgrad_svd_lowrank_cuda_float64
11 221.032 test_fn_fwgrad_bwgrad_istft_cuda_complex128
12 183.741 test_fn_fwgrad_bwgrad_lu_unpack_cuda_complex128
13 132.019 test_fn_fwgrad_bwgrad_nn_functional_unfold_cuda_float64
14 125.343 test_fn_fwgrad_bwgrad_nn_functional_pad_constant_cuda_complex128
15 124.2 test_fn_fwgrad_bwgrad_kron_cuda_complex128
16 123.721 test_fn_fwgrad_bwgrad_pca_lowrank_cuda_float64
17 121.074 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
18 119.387 test_fn_fwgrad_bwgrad_rot90_cuda_complex128
19 112.889 test_fn_fwgrad_bwgrad__masked_normalize_cuda_complex128
20 107.541 test_fn_fwgrad_bwgrad_dist_cuda_complex128
21 106.727 test_fn_fwgrad_bwgrad_diff_cuda_complex128
22 104.588 test_fn_fwgrad_bwgrad__masked_cumprod_cuda_complex128
23 100.135 test_fn_fwgrad_bwgrad_nn_functional_feature_alpha_dropout_with_train_cuda_float64
24 88.359 test_fn_fwgrad_bwgrad_mH_cuda_complex128
25 86.214 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
26 83.037 test_fn_fwgrad_bwgrad_nn_functional_bilinear_cuda_float64
27 79.987 test_fn_fwgrad_bwgrad__masked_cumsum_cuda_complex128
28 77.822 test_fn_fwgrad_bwgrad_diag_embed_cuda_complex128
29 76.256 test_fn_fwgrad_bwgrad_mT_cuda_complex128
30 74.039 test_fn_fwgrad_bwgrad_linalg_lu_solve_cuda_complex128
```
```
0 334.142 test_fn_fwgrad_bwgrad_unfold_cuda_complex128
1 312.791 test_fn_fwgrad_bwgrad_linalg_lu_factor_cuda_complex128
2 121.963 test_fn_fwgrad_bwgrad_nn_functional_max_unpool3d_cuda_float64
3 108.085 test_fn_fwgrad_bwgrad_diff_cuda_complex128
4 89.418 test_fn_fwgrad_bwgrad_nn_functional_max_unpool2d_cuda_float64
5 72.231 test_fn_fwgrad_bwgrad___rdiv___cuda_complex128
6 69.433 test_fn_fwgrad_bwgrad___getitem___cuda_complex128
7 68.582 test_fn_fwgrad_bwgrad_ldexp_cuda_complex128
8 68.572 test_fn_fwgrad_bwgrad_linalg_pinv_cuda_complex128
9 67.585 test_fn_fwgrad_bwgrad_nn_functional_glu_cuda_float64
10 66.567 test_fn_fwgrad_bwgrad_lu_cuda_float64
```
```
0 630.13 test_fn_gradgrad_nn_functional_conv2d_cuda_complex128
1 81.086 test_fn_gradgrad_linalg_solve_triangular_cuda_complex128
2 71.332 test_fn_gradgrad_norm_cuda_complex128
3 64.308 test_fn_gradgrad__masked_std_cuda_complex128
4 59.519 test_fn_gradgrad_div_no_rounding_mode_cuda_complex128
5 58.836 test_fn_gradgrad_nn_functional_adaptive_avg_pool3
```
Reduces the sizes of the inputs for:
- diff
- diag_embed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80514
Approved by: https://github.com/albanD
This PR also adds complex support for logdet, and makes all these
functions support out= and be composite depending on one function. We
also extend the support of `logdet` to complex numbers and improve the
docs of all these functions.
We also use `linalg_lu_factor_ex` in these functions, so we remove the
synchronisation present before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79742
Approved by: https://github.com/IvanYashchuk, https://github.com/albanD
Make `MetaConverter` and `FakeTensorConverter` hold weak references to their memoized tensors, and also have `MetaConverter` hold weak reference to Tensor storage. Otherwise it can be tricky for users to make sure all existing FakeTensors or FakeTensorModes are deleted which otherwise will leak memory.
I ran into https://github.com/pytorch/pytorch/issues/7733 which I was able to get around with the following (see comment from code):
```
# torch.Tensors cannot be used as a key in a dictionary
# because they define a custom __eq__ function which when used
# to resolve hash collisions will throw when comparing tensors:
# "RuntimeError: bool value of Tensor with more than one value is ambiguous."
# To avoid that, we use an object which will hold a Tensor and use
# its id for both hashing and equality.
# In order to use this as a weak key reference, we cannot
# simply use weakref.WeakKeyDictionary because the newly constructed
# WeakTensorRefKey only use would be a dictionary so it would have no strong
# references.
# To get around this issue, we can use it as a normal key, and then set
# `weakref.finalize` to delete the key when its contained tensor dies.
```
While for the tensor memo we can set a `weakref.finalize` callback that will remove the corresponding `WeakTensorRefKey` from the tensor memo, at the point that this callback is invoked the tensor storage is not yet deallocated.. See comment from code:
```
# [expired-storages]
# NB: even though the tensor has died,
# the deallocation of its storage can take longer,
# even when the storage has no other uses/views.
# In this case, the StorageWeakRef object will be kept alive
# longer than it needs to be, however the storage itself
# will be deallocated. We retain the possibly dead storages
# and periodically check if any of them are expired and
# can be freed.
```
partial fix for https://github.com/pytorch/torchdynamo/issues/468
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80544
Approved by: https://github.com/ezyang
This PR heavily simplifies the code of `linalg.solve`. At the same time,
this implementation saves quite a few copies of the input data in some
cases (e.g. A is contiguous)
We also implement it in such a way that the derivative goes from
computing two LU decompositions and two LU solves to no LU
decompositions and one LU solves. It also avoids a number of unnecessary
copies the derivative was unnecessarily performing (at least the copy of
two matrices).
On top of this, we add a `left` kw-only arg that allows the user to
solve `XA = B` rather concisely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74046
Approved by: https://github.com/nikitaved, https://github.com/IvanYashchuk, https://github.com/mruberry
This PR adds `linalg.lu_solve`. While doing so, I found a bug in MAGMA
when calling the batched MAGMA backend with trans=True. We work around
that by solving the system solving two triangular systems.
We also update the heuristics for this function, as they were fairly
updated. We found that cuSolver is king, so luckily we do not need to
rely on the buggy backend from magma for this function.
We added tests testing this function left and right. We also added tests
for the different backends. We also activated the tests for AMD, as
those should work as well.
Fixes https://github.com/pytorch/pytorch/issues/61657
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77634
Approved by: https://github.com/malfet
When I run against a list of DPER ops I get this list:
```
aten.count_nonzero.dim_IntList
aten.count_nonzero.default
aten.empty.memory_format # SKIP
aten.repeat_interleave.Tensor
aten.relu.default
aten.nonzero.out
aten.nonzero.default
```
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78526
Approved by: https://github.com/zou3519
Seems like it should be one. This will make it possible to register
meta implementations even when there is a CompositeImplicitAutograd
registration already. It also paves the way for sparse meta, etc.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78469
Approved by: https://github.com/ngimel
This PR is a result of collaboration with @rdspring1 and @mruberry on primTorch.
It adds the following prims:
- `fmax`
- `fmin`
- `fmod`
And adds the following refs:
- `fmax`
- `fmin`
- `fmod`
- `logical_xor`
The work is in progress as there are some tests that fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78023
Approved by: https://github.com/mruberry
This PR...
**Issues Found**
- https://github.com/pytorch/pytorch/issues/78058
- https://github.com/pytorch/pytorch/issues/78054
- https://github.com/pytorch/pytorch/issues/78053
- https://github.com/pytorch/pytorch/issues/78050
- https://github.com/pytorch/pytorch/issues/77932
**Testing**
- disables stride consistency checks in test_ops and test_meta pending resolution of https://github.com/pytorch/pytorch/issues/78050
- skips chalf in reference tests (addressing https://github.com/pytorch/pytorch/issues/78054)
- splits test test_python_reference_consistency in one test for the ctx where torch.foo is torch.foo, and another for when torch.foo is refs.foo
- updates test names to be more natural and consistent:
- test_python_reference_errors -> test_python_ref_errors
- test_python_reference_consistency -> test_python_ref and test_python_ref_torch_fallback
- test_python_reference_meta_functions -> test_python_ref_meta
- test_reference_testing -> test_numpy_ref
- updates test_python_ref and test_python_ref_torch_fallback to check that the reference is more accurate than the torch op if the reference and torch op results are not close, a warning is raised when this occurs (addressing https://github.com/pytorch/pytorch/issues/77687)
- adds reference inputs for broadcast_tensors
- Updates the "fill_" OpInfo to "fill", adding a NumPy reference and making it an elementwise unary operator
- Adds 1D no element sample inputs to the cat OpInfo and updates the NumPy reference to handle them and type promotion correctly
- Adds reference inputs for elementwise ternary operations, like clamp
- Adds a NumPy reference for clamp
- Adds reference inputs to where's OpInfo
- Makes softplus an elementwise unary OpInfo
- Removes the great majority of Python reference OpInfo skips and xfails due to the above test changes
- Adds Python reference OpInfos for fill, dropout, clamp, broadcast_tensors, and where
**Prims**
- adds the fill, empty_strided, and uniform prims
- removes the empty, empty_like, full, and full_like prims -- these are now references that use empty_strided and fill
- renames the "concatenate" and "select" prims to "cat" and "where", respectively, to be consistent with PyTorch
- extends the `_elementwise_meta` operation to accepts tensors that don't participate in type promotion, like the `cond` tensor in `where`
- fixes a bug in the stride propagation of broadcast_in_dim
- moves some error checks from prims.cat to prims.where to refs.cat and refs.where, respectively, consistent with our new policy of doing as much error checking in the ref as possible
**Utils**
- adds the canoicalize_device, extract_shape, and extract_shape_from_varargs helpers
- adds the elementwise_unary_scalar_wrapper -- this allows elementwise unary operators to take and return scalar values (ex. refs.sin(1) will return .84...)
**Refs**
- adds the fill, broadcast_tensors, clamp, empty_strided, ones, zeros, and uniform references
- adds the nn.functional.dropout reference
- fixes refs.cat to handle 1D tensors with no inputs consistent with eager mode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78026
Approved by: https://github.com/ngimel
This PR...
**Issues Found**
- https://github.com/pytorch/pytorch/issues/78058
- https://github.com/pytorch/pytorch/issues/78054
- https://github.com/pytorch/pytorch/issues/78053
- https://github.com/pytorch/pytorch/issues/78050
- https://github.com/pytorch/pytorch/issues/77932
**Testing**
- disables stride consistency checks in test_ops and test_meta pending resolution of https://github.com/pytorch/pytorch/issues/78050
- skips chalf in reference tests (addressing https://github.com/pytorch/pytorch/issues/78054)
- splits test test_python_reference_consistency in one test for the ctx where torch.foo is torch.foo, and another for when torch.foo is refs.foo
- updates test names to be more natural and consistent:
- test_python_reference_errors -> test_python_ref_errors
- test_python_reference_consistency -> test_python_ref and test_python_ref_torch_fallback
- test_python_reference_meta_functions -> test_python_ref_meta
- test_reference_testing -> test_numpy_ref
- updates test_python_ref and test_python_ref_torch_fallback to check that the reference is more accurate than the torch op if the reference and torch op results are not close, a warning is raised when this occurs (addressing https://github.com/pytorch/pytorch/issues/77687)
- adds reference inputs for broadcast_tensors
- Updates the "fill_" OpInfo to "fill", adding a NumPy reference and making it an elementwise unary operator
- Adds 1D no element sample inputs to the cat OpInfo and updates the NumPy reference to handle them and type promotion correctly
- Adds reference inputs for elementwise ternary operations, like clamp
- Adds a NumPy reference for clamp
- Adds reference inputs to where's OpInfo
- Makes softplus an elementwise unary OpInfo
- Removes the great majority of Python reference OpInfo skips and xfails due to the above test changes
- Adds Python reference OpInfos for fill, dropout, clamp, broadcast_tensors, and where
**Prims**
- adds the fill, empty_strided, and uniform prims
- removes the empty, empty_like, full, and full_like prims -- these are now references that use empty_strided and fill
- renames the "concatenate" and "select" prims to "cat" and "where", respectively, to be consistent with PyTorch
- extends the `_elementwise_meta` operation to accepts tensors that don't participate in type promotion, like the `cond` tensor in `where`
- fixes a bug in the stride propagation of broadcast_in_dim
- moves some error checks from prims.cat to prims.where to refs.cat and refs.where, respectively, consistent with our new policy of doing as much error checking in the ref as possible
**Utils**
- adds the canoicalize_device, extract_shape, and extract_shape_from_varargs helpers
- adds the elementwise_unary_scalar_wrapper -- this allows elementwise unary operators to take and return scalar values (ex. refs.sin(1) will return .84...)
**Refs**
- adds the fill, broadcast_tensors, clamp, empty_strided, ones, zeros, and uniform references
- adds the nn.functional.dropout reference
- fixes refs.cat to handle 1D tensors with no inputs consistent with eager mode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78026
Approved by: https://github.com/ngimel
This PR...
**Filed the Following Issues**
- https://github.com/pytorch/pytorch/issues/77553
- https://github.com/pytorch/pytorch/issues/77526
- https://github.com/pytorch/pytorch/issues/77600
**Testing**
- Updates test_dtypes to longer attempt to test the backward of sample inputs where no inputs require grad
- Adds a new test_python_reference_errors; it ensures the meta operations for references throw errors as expected
- Updates compare_tensor_meta to better handle CUDA devices, and (temporarily) restricts stride checking to the CUDA device type
- Elementwise unary and elementwise binary operators now have arbitrarily strided reference inputs
- Reference inputs for _like functions are added
- An OpInfo for torch.empty is added
- Reference inputs for torch.clone are added
- A NumPy reference for clone is added
- Adds OpInfos for refs.empty and refs.empty_like
**Prims**
- Renames the "max" and "min" prims have been renamed to "maximum" and "minimum," respectively, to better conform to their ATen names
- Adds the empty, empty_like, full, and full_like prims
- Fixes the elementwise meta function's stride propagation
- Fixes clone's meta function's stride propagation
- Fixes convert_element_type's meta's stride propagation
- Adds a (temporary) _to_dtype pprivate prim that casts a tensor while preserving its stride permutation
- Removes the _set prim comment
- Adds utils.compute_elementwise_output_strides, which computes the correct output strides for elementwise operations
- Corrects an issue where utils.make_contiguous_strides_for was creating the incorrect strides for tensors with no elements
**References**
- Adds the empty, empty_like, full, full_like, and ones_like refs
- Extends make_elementwise_unary_reference to accept an additional callable to perform extra input validation
- Adds an extra validation function to handle refs.neg(BoolTensor)
- Updates the isfinite ref to call ones_like when appropriate
- Models Python scalar handling for elementwise binary operations
- Added a 64 dim check for the amin and amax references
- opmath is now a flag that can be set separately for cpu and CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77542
Approved by: https://github.com/ezyang
This PR does a number of things:
- Move linalg.vector_norm to structured kernels and simplify the logic
- Fixes a number of prexisting issues with the dtype kwarg of these ops
- Heavily simplifies and corrects the logic of `linalg.matrix_norm` and `linalg.norm` to be consistent with the docs
- Before the `_out` versions of these functions were incorrect
- Their implementation is now as efficient as expected, as it avoids reimplementing these operations whenever possible.
- Deprecates `torch.frobenius_norm` and `torch.nuclear_norm`, as they were exposed in the API and they are apparently being used in mobile (??!!) even though they were not documented and their implementation was slow.
- I'd love to get rid of these functions already, but I guess we have to go through their deprecation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76547
Approved by: https://github.com/mruberry
We don't have any coverage for meta tensor correctness for backwards
because torch function mode can only allow us to interpose on
Python torch API calls, but backwards invocations happen from C++.
To make this possible, I add torch_dispatch_meta test which runs the
tests with __torch_dispatch__
While doing this, I needed to generate fresh expected failure / skip
lists for the new test suite, and I discovered that my original
scaffolding for this purpose was woefully insufficient. So I rewrote
how the test framework worked, and at the same time rewrote the
__torch_function__ code to also use the new logic. Here's whats
new:
- Expected failure / skip is now done on a per function call basis,
rather than the entire test. This means that separate OpInfo
samples for a function don't affect each other.
- There are now only two lists: expect failure list (where the test
consistently fails on all runs) and skip list (where the test
sometimes passes and fails.
- We explicitly notate the dtype that failed. I considered detecting
when something failed on all dtypes, but this was complicated and
listing everything out seemed to be nice and simple. To keep the
dtypes short, I introduce a shorthand notation for dtypes.
- Conversion to meta tensors is factored into its own class
MetaConverter
- To regenerate the expected failure / skip lists, just run with
PYTORCH_COLLECT_EXPECT and filter on a specific test type
(test_meta or test_dispatch_meta) for whichever you want to update.
Other misc fixes:
- Fix max_pool1d to work with BFloat16 in all circumstances, by making
it dispatch and then fixing a minor compile error (constexpr doesn't
work with BFloat16)
- Add resolve_name for turning random torch API functions into string
names
- Add push classmethod to the Mode classes, so that you can more easily
push a mode onto the mode stack
- Add some more skips for missing LAPACK
- Added an API to let you query if there's already a registration for
a function, added a test to check that we register_meta for all
decompositions (except detach, that decomp is wrong lol), and then
update all the necessary sites to make the test pass.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77477
Approved by: https://github.com/zou3519
Decompositions can be used to fill in meta support where necessary,
assuming the operations they decompose to support meta key.
This PR adds register_meta kwarg to register_decomposition that
optionally lets you register the meta to the C++ dispatch table
for meta tensors. I use this to then get the meta function for
where and huber_loss for free.
Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77353
Approved by: https://github.com/mruberry