Summary:
Apple recently announced ML Compute, a new framework available in macOS Big Sur, which enables users to accelerate the training of neural networks on Mac hardware. This PR is the first on a series of PRs that will enable the integration with ML Compute. Most of the integration code will live on a separate subrepo named `mlc`.
The integration with `mlc` (ML Compute) will be very similar to that of xla. We rely on registering our ops through:
TORCH_LIBRARY_IMPL(aten, PrivateUse1, m) {
m.impl_UNBOXED(<op_schema_name>, &customized_op_kernel)
...
}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50634
Reviewed By: malfet
Differential Revision: D26614213
Pulled By: smessmer
fbshipit-source-id: 3b492b346c61cc3950ac880ac01a82fbdddbc07b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51807
Implemented torch.linalg.multi_dot similar to [numpy.linalg.multi_dot](https://numpy.org/doc/stable/reference/generated/numpy.linalg.multi_dot.html).
This function does not support broadcasting or batched inputs at the moment.
**NOTE**
numpy.linalg.multi_dot allows the first and last tensors to have more than 2 dimensions despite their docs stating these must be either 1D or 2D. This PR diverges from NumPy in that it enforces this restriction.
**TODO**
- [ ] Benchmark against NumPy
- [x] Add OpInfo testing
- [x] Remove unnecessary copy for out= argument
Test Plan: Imported from OSS
Reviewed By: nikithamalgifb
Differential Revision: D26375734
Pulled By: heitorschueroff
fbshipit-source-id: 839642692424c4b1783606c76dd5b34455368f0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51878
`fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` are implementation
details of `fake_quantize_per_tensor_affine` and
`fake_quantize_per_channel_affine`, removing the
Python bindings for them since there is no need to
expose them.
Test Plan:
```
python test/test_quantization.py TestFakeQuantize
```
Imported from OSS
Reviewed By: albanD, bugra
Differential Revision: D26314173
fbshipit-source-id: 733c93a3951453e739b6ed46b72fbad2244f6e97
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624
~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~
Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560
Reviewed By: ngimel
Differential Revision: D26171175
Pulled By: mruberry
fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280
As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.
I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.
Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26123271
Pulled By: mruberry
fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
Summary:
Implements `np.diff` for single order differences only:
- method and function variants for `diff` and function variant for `diff_out`
- supports out variant, but not in-place since shape changes
- adds OpInfo entry, and test in `test_torch`
- automatic autograd because we are using the `Math` dispatch
_Update: we only support Tensors for prepend and append in this PR. See discussion below and comments for more details._
Currently there is a quirk in the c++ API based on how this is implemented: it is not possible to specify scalar prepend and appends without also specifying all 4 arguments.
That is because the goal is to match NumPy's diff signature of `diff(int n=1, int dim=-1, Union[Scalar, Tensor] prepend=None, Union[Scalar, Tensor] append)=None` where all arguments are optional, positional and in the correct order.
There are a couple blockers. One is c++ ambiguity. This prevents us from simply doing `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)` etc for all combinations of {Tensor, Scalar} x {Tensor, Scalar}.
Why not have append, prepend not have default args and then write out the whole power set of {Tensor, Scalar, omitted} x {Tensor, Scalar, omitted} you might ask. Aside from having to write 18 overloads, this is actually illegal because arguments with defaults must come after arguments without defaults. This would mean having to write `diff(prepend, append, n, dim)` which is not desired. Finally writing out the entire power set of all arguments n, dim, prepend, append is out of the question because that would actually involve 2 * 2 * 3 * 3 = 36 combinations. And if we include the out variant, that would be 72 overloads!
With this in mind, the current way this is implemented is actually to still do `diff(int n=1, int dim=-1, Scalar? prepend=None, Tensor? append=None)`. But also make use of `cpp_no_default_args`. The idea is to only have one of the 4 {Tensor, Scalar} x {Tensor, Scalar} provide default arguments for the c++ api, and add `cpp_no_default_args` for the remaining 3 overloads. With this, Python api works as expected, but some calls such as `diff(prepend=1)` won't work on c++ api.
We can optionally add 18 more overloads that cover the {dim, n, no-args} x {scalar-tensor, tensor-scalar, scalar-scalar} x {out, non-out} cases for c++ api. _[edit: counting is hard - just realized this number is still wrong. We should try to count the cases we do cover instead and subtract that from the total: (2 * 2 * 3 * 3) - (3 + 2^4) = 17. 3 comes from the 3 of 4 combinations of {tensor, scalar}^2 that we declare to be `cpp_no_default_args`, and the one remaining case that has default arguments has covers 2^4 cases. So actual count is 34 additional overloads to support all possible calls]_
_[edit: thanks to https://github.com/pytorch/pytorch/issues/50767 hacky_wrapper is no longer necessary; it is removed in the latest commit]_
hacky_wrapper was also necessary here because `Tensor?` will cause dispatch to look for the `const optional<Tensor>&` schema but also generate a `const Tensor&` declaration in Functions.h. hacky_wrapper allows us to define our function as `const Tensor&` but wraps it in optional for us, so this avoids both the errors while linking and loading.
_[edit: rewrote the above to improve clarity and correct the fact that we actually need 18 more overloads (26 total), not 18 in total to complete the c++ api]_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50569
Reviewed By: H-Huang
Differential Revision: D26176105
Pulled By: soulitzer
fbshipit-source-id: cd8e77cc2de1117c876cd71c29b312887daca33f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51255
This is the same as #50561, but for per-channel fake_quant.
TODO before land write up better
Memory and performance impact (MobileNetV2): TODO
Performance impact (microbenchmarks): https://gist.github.com/vkuzo/fbe1968d2bbb79b3f6dd776309fbcffc
* forward pass on cpu: 512ms -> 750ms (+46%)
* forward pass on cuda: 99ms -> 128ms (+30%)
* note: the overall performance impact to training jobs should be minimal, because this is used for weights, and relative importance of fq is dominated by fq'ing the activations
* note: we can optimize the perf in a future PR by reading once and writing twice
Test Plan:
```
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_forward_per_channel_cachemask_cuda
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cpu
python test/test_quantization.py TestFakeQuantize.test_backward_per_channel_cachemask_cuda
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26117721
fbshipit-source-id: 798b59316dff8188a1d0948e69adf9e5509e414c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561
Not for review yet, a bunch of TODOs need finalizing.
tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.
There are two benefits:
1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.
2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.
TODO: describe in more detail
Test Plan:
OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
--print-freq 1
--data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
--output-dir ~/nfs/pytorch_vision_tests/
--backend qnnpack
--epochs 5
TODO paste results here
```
TODO more
Imported from OSS
Reviewed By: ngimel
Differential Revision: D25918519
fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6
Summary:
Add a new device type 'XPU' ('xpu' for lower case) to PyTorch. Changes are needed for code related to device model and kernel dispatch, e.g. DeviceType, Backend and DispatchKey etc.
https://github.com/pytorch/pytorch/issues/48246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49786
Reviewed By: mrshenli
Differential Revision: D25893962
Pulled By: ezyang
fbshipit-source-id: 7ff0a316ee34cf0ed6fc7ead08ecdeb7df4b0052
Summary:
This PR adds `torch.linalg.slogdet`.
Changes compared to the original torch.slogdet:
- Complex input now works as in NumPy
- Added out= variant (allocates temporary and makes a copy for now)
- Updated `slogdet_backward` to work with complex input
Ref. https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49194
Reviewed By: VitalyFedyunin
Differential Revision: D25916959
Pulled By: mruberry
fbshipit-source-id: cf9be8c5c044870200dcce38be48cd0d10e61a48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49502
It broke the OSS CI the last time I landed it, mostly cuda tests and python bindings.
Similar to permute_out, add the out variant of `aten::narrow` (slice in c2) which does an actual copy. `aten::narrow` creates a view, however, an copy is incurred when we call `input.contiguous` in the ops that follow `aten::narrow`, in `concat_add_mul_replacenan_clip`, `casted_batch_one_hot_lengths`, and `batch_box_cox`.
{F351263599}
Test Plan:
Unit test:
```
buck test //caffe2/aten:math_kernel_test
buck test //caffe2/test:sparse -- test_narrow
```
Benchmark with the adindexer model:
```
bs = 1 is neutral
Before:
I1214 21:32:51.919239 3285258 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0886948. Iters per second: 11274.6
After:
I1214 21:32:52.492352 3285277 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.0888019. Iters per second: 11261
bs = 20 shows more gains probably because the tensors are bigger and therefore the cost of copying is higher
Before:
I1214 21:20:19.702445 3227229 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.527563. Iters per second: 1895.51
After:
I1214 21:20:20.370173 3227307 PyTorchPredictorBenchLib.cpp:209] PyTorch run finished. Milliseconds per iter: 0.508734. Iters per second: 1965.67
```
Reviewed By: ajyu
Differential Revision: D25596290
fbshipit-source-id: da2f5a78a763895f2518c6298778ccc4d569462c
Summary:
This PR adds `torch.linalg.pinv`.
Changes compared to the original `torch.pinverse`:
* New kwarg "hermitian": with `hermitian=True` eigendecomposition is used instead of singular value decomposition.
* `rcond` argument can now be a `Tensor` of appropriate shape to apply matrix-wise clipping of singular values.
* Added `out=` variant (allocates temporary and makes a copy for now)
Ref. https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48399
Reviewed By: zhangguanheng66
Differential Revision: D25869572
Pulled By: mruberry
fbshipit-source-id: 0f330a91d24ba4e4375f648a448b27594e00dead
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48965
This PR pulls `__torch_function__` checking entirely into C++, and adds a special `object_has_torch_function` method for ops which only have one arg as this lets us skip tuple construction and unpacking. We can now also do away with the Python side fast bailout for `Tensor` (e.g. `if any(type(t) is not Tensor for t in tensors) and has_torch_function(tensors)`) because they're actually slower than checking with the Python C API.
Test Plan: Existing unit tests. Benchmarks are in #48966
Reviewed By: ezyang
Differential Revision: D25590732
Pulled By: robieta
fbshipit-source-id: 6bd74788f06cdd673f3a2db898143d18c577eb42
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.
`linalg_inv_out` uses in-place operations on provided `result` tensor.
I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.
I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.
Ref https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261
Reviewed By: gchanan
Differential Revision: D25849590
Pulled By: mruberry
fbshipit-source-id: cfee6f1daf7daccbe4612ec68f94db328f327651
Summary:
This is related to https://github.com/pytorch/pytorch/issues/42666 .
I am opening this PR to have the opportunity to discuss things.
First, we need to consider the differences between `torch.svd` and `numpy.linalg.svd`:
1. `torch.svd` takes `some=True`, while `numpy.linalg.svd` takes `full_matrices=True`, which is effectively the opposite (and with the opposite default, too!)
2. `torch.svd` returns `(U, S, V)`, while `numpy.linalg.svd` returns `(U, S, VT)` (i.e., V transposed).
3. `torch.svd` always returns a 3-tuple; `numpy.linalg.svd` returns only `S` in case `compute_uv==False`
4. `numpy.linalg.svd` also takes an optional `hermitian=False` argument.
I think that the plan is to eventually deprecate `torch.svd` in favor of `torch.linalg.svd`, so this PR does the following:
1. Rename/adapt the old `svd` C++ functions into `linalg_svd`: in particular, now `linalg_svd` takes `full_matrices` and returns `VT`
2. Re-implement the old C++ interface on top of the new (by negating `full_matrices` and transposing `VT`).
3. The C++ version of `linalg_svd` *always* returns a 3-tuple (we can't do anything else). So, there is a python wrapper which manually calls `torch._C._linalg.linalg_svd` to tweak the return value in case `compute_uv==False`.
Currently, `linalg_svd_backward` is broken because it has not been adapted yet after the `V ==> VT` change, but before continuing and spending more time on it I wanted to make sure that the general approach is fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45562
Reviewed By: H-Huang
Differential Revision: D25803557
Pulled By: mruberry
fbshipit-source-id: 4966f314a0ba2ee391bab5cda4563e16275ce91f
Summary:
I am opening this PR early to have a place to discuss design issues.
The biggest difference between `torch.qr` and `numpy.linalg.qr` is that the former `torch.qr` takes a boolean parameter `some=True`, while the latter takes a string parameter `mode='reduced'` which can be one of the following:
`reduced`
this is completely equivalent to `some=True`, and both are the default.
`complete`
this is completely equivalent to `some=False`.
`r`
this returns only `r` instead of a tuple `(r, q)`. We have already decided that we don't want different return types depending on the parameters, so I propose to return `(r, empty_tensor)` instead. I **think** that in this mode it will be impossible to implement the backward pass, so we should raise an appropriate error in that case.
`raw`
in this mode, it returns `(h, tau)` instead of `(q, r)`. Internally, `h` and `tau` are obtained by calling lapack's `dgeqrf` and are later used to compute the actual values of `(q, r)`. The numpy docs suggest that these might be useful to call other lapack functions, but at the moment none of them is exposed by numpy and I don't know how often it is used in the real world.
I suppose the implementing the backward pass need attention to: the most straightforward solution is to use `(h, tau)` to compute `(q, r)` and then use the normal logic for `qr_backward`, but there might be faster alternatives.
`full`, `f`
alias for `reduced`, deprecated since numpy 1.8.0
`economic`, `e`
similar to `raw but it returns only `h` instead of `(h, tau). Deprecated since numpy 1.8.0
To summarize:
* `reduce`, `complete` and `r` are straightforward to implement.
* `raw` needs a bit of extra care, but I don't know how much high priority it is: since it is used rarely, we might want to not support it right now and maybe implement it in the future?
* I think we should just leave `full` and `economic` out, and possibly add a note to the docs explaining what you need to use instead
/cc mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47764
Reviewed By: ngimel
Differential Revision: D25708870
Pulled By: mruberry
fbshipit-source-id: c25c70a23a02ec4322430d636542041e766ebe1b
Summary:
This PR adds `torch.linalg.inv` for NumPy compatibility.
`linalg_inv_out` uses in-place operations on provided `result` tensor.
I modified `apply_inverse` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_inv_out` but removing the error checks and device memory synchronization.
I fixed `lda` (leading dimension parameter which is max(1, n)) in many places to handle 0x0 matrices correctly.
Zero batch dimensions are also working and tested.
Ref https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48261
Reviewed By: ngimel
Differential Revision: D25690129
Pulled By: mruberry
fbshipit-source-id: edb2d03721f22168c42ded8458513cb23dfdc712
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49214
**BC-Breaking**
Before this PR, `%=` didn't actually do the operation inplace and returned a new tensor.
After this PR, `%=` operation is actually inplace and the modified input tensor is returned.
Before PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139627966219328
>>> a %= 10
>>> id(a)
139627966219264
```
After PR,
```python
>>> import torch
>>> a = torch.tensor([11,12,13])
>>> id(a)
139804702425280
>>> a %= 10
>>> id(a)
139804702425280
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49390
Reviewed By: izdeby
Differential Revision: D25560423
Pulled By: zou3519
fbshipit-source-id: 2b92bfda260582aa4ac22c4025376295e51f854e
Summary:
Related https://github.com/pytorch/pytorch/issues/38349
Implement NumPy-like function `torch.broadcast_to` to broadcast the input tensor to a new shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48997
Reviewed By: anjali411, ngimel
Differential Revision: D25663937
Pulled By: mruberry
fbshipit-source-id: 0415c03f92f02684983f412666d0a44515b99373
Summary:
This PR adds `torch.linalg.solve`.
`linalg_solve_out` uses in-place operations on the provided result tensor.
I modified `apply_solve` to accept tensor of Int instead of std::vector, that way we can write a function similar to `linalg_solve_out` but removing the error checks and device memory synchronization.
In comparison to `torch.solve` this routine accepts 1-dimensional tensors and batches of 1-dim tensors for the right-hand-side term. `torch.solve` requires it to be at least 2-dimensional.
Ref. https://github.com/pytorch/pytorch/issues/42666
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48456
Reviewed By: izdeby
Differential Revision: D25562222
Pulled By: mruberry
fbshipit-source-id: a9355c029e2442c2e448b6309511919631f9e43b
Summary:
This PR is to change the `aten::native_layer_norm` and `aten::native_layer_norm_backward` signature to match `torch.layer_norm` definition. The current definition doesn't provide enough information to the PyTorch JIT to fuse layer_norm during training.
`native_layer_norm(X, gamma, beta, M, N, eps)` =>
`native_layer_norm(input, normalized_shape, weight, bias, eps)`
`native_layer_norm_backward(dY, X, mean, rstd, gamma, M, N, grad_input_mask)` =>
`native_layer_norm_backward(dY, input, normalized_shape, mean, rstd, weight, bias, grad_input_mask)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48971
Reviewed By: izdeby
Differential Revision: D25574070
Pulled By: ngimel
fbshipit-source-id: 23e2804295a95bda3f1ca6b41a1e4c5a3d4d31b4
Summary:
Ref https://github.com/pytorch/pytorch/issues/42175
This removes the 4 deprecated spectral functions: `torch.{fft,rfft,ifft,irfft}`. `torch.fft` is also now imported by by default.
The actual `at::native` functions are still used in `torch.stft` so can't be full removed yet. But will once https://github.com/pytorch/pytorch/issues/47601 has been merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48594
Reviewed By: heitorschueroff
Differential Revision: D25298929
Pulled By: mruberry
fbshipit-source-id: e36737fe8192fcd16f7e6310f8b49de478e63bf0